Patch series "Provide generic top-down mmap layout functions", v6.
This series introduces generic functions to make top-down mmap layout
easily accessible to architectures, in particular riscv which was the
initial goal of this series. The generic implementation was taken from
arm64 and used successively by arm, mips and finally riscv.
Note that in addition the series fixes 2 issues:
- stack randomization was taken into account even if not necessary.
- [1] fixed an issue with mmap base which did not take into account
randomization but did not report it to arm and mips, so by moving arm64
into a generic library, this problem is now fixed for both
architectures.
This work is an effort to factorize architecture functions to avoid code
duplication and oversights as in [1].
[1]: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1429066.html
This patch (of 14):
This preparatory commit moves this function so that further introduction
of generic topdown mmap layout is contained only in mm/util.c.
Link: http://lkml.kernel.org/r/20190730055113.23635-2-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
khugepaged needs exclusive mmap_sem to access page table. When it fails
to lock mmap_sem, the page will fault in as pte-mapped THP. As the page
is already a THP, khugepaged will not handle this pmd again.
This patch enables the khugepaged to retry collapse the page table.
struct mm_slot (in khugepaged.c) is extended with an array, containing
addresses of pte-mapped THPs. We use array here for simplicity. We can
easily replace it with more advanced data structures when needed.
In khugepaged_scan_mm_slot(), if the mm contains pte-mapped THP, we try to
collapse the page table.
Since collapse may happen at an later time, some pages may already fault
in. collapse_pte_mapped_thp() is added to properly handle these pages.
collapse_pte_mapped_thp() also double checks whether all ptes in this pmd
are mapping to the same THP. This is necessary because some subpage of
the THP may be replaced, for example by uprobe. In such cases, it is not
possible to collapse the pmd.
[kirill.shutemov@linux.intel.com: add comments for retract_page_tables()]
Link: http://lkml.kernel.org/r/20190816145443.6ard3iilytc6jlgv@box
Link: http://lkml.kernel.org/r/20190815164525.1848545-6-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration. For example the below test would
run into premature OOM easily:
$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000
transhuge-stress comes from kernel selftest.
It is easy to hit OOM, but there are still a lot THP on the deferred split
queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.
Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue. The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg. When the page is
immigrated to the other memcg, it will be immigrated to the target memcg's
deferred split queue too.
Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.
[yang.shi@linux.alibaba.com: simplify deferred split queue dereference per Kirill Tkhai]
Link: http://lkml.kernel.org/r/1566496227-84952-5-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1565144277-36240-5-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Make deferred split shrinker memcg aware", v6.
Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration. For example the below test would
run into premature OOM easily:
$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000
transhuge-stress comes from kernel selftest.
It is easy to hit OOM, but there are still a lot THP on the deferred split
queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.
Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue. The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg. When the page is
immigrated to the other memcg, it will be immigrated to the target memcg's
deferred split queue too.
Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.
Make deferred split shrinker not depend on memcg kmem since it is not
slab. It doesn't make sense to not shrink THP even though memcg kmem is
disabled.
With the above change the test demonstrated above doesn't trigger OOM even
though with cgroup.memory=nokmem.
This patch (of 4):
Put split_queue, split_queue_lock and split_queue_len into a struct in
order to reduce code duplication when we convert deferred_split to memcg
aware in the later patches.
Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz reports that "hugetlb allocations could stall for minutes or
hours when should_compact_retry() would return true more often then it
should. Specifically, this was in the case where compact_result was
COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being
made."
The problem is that the compaction_withdrawn() test in
should_compact_retry() includes compaction outcomes that are only possible
on low compaction priority, and results in a retry without increasing the
priority. This may result in furter reclaim, and more incomplete
compaction attempts.
With this patch, compaction priority is raised when possible, or
should_compact_retry() returns false.
The COMPACT_SKIPPED result doesn't really fit together with the other
outcomes in compaction_withdrawn(), as that's a result caused by
insufficient order-0 pages, not due to low compaction priority. With this
patch, it is moved to a new compaction_needs_reclaim() function, and for
that outcome we keep the current logic of retrying if it looks like
reclaim will be able to help.
Link: http://lkml.kernel.org/r/20190806014744.15446-4-mike.kravetz@oracle.com
Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Objective
---------
The current implementation of struct vmap_area wasted space.
After applying this commit, sizeof(struct vmap_area) has been
reduced from 11 words to 8 words.
Description
-----------
1) Pack "subtree_max_size", "vm" and "purge_list". This is no problem
because
A) "subtree_max_size" is only used when vmap_area is in "free" tree
B) "vm" is only used when vmap_area is in "busy" tree
C) "purge_list" is only used when vmap_area is in vmap_purge_list
2) Eliminate "flags".
;Since only one flag VM_VM_AREA is being used, and the same thing can be
done by judging whether "vm" is NULL, then the "flags" can be eliminated.
Link: http://lkml.kernel.org/r/20190716152656.12255-3-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Suggested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each memory block spans the same amount of sections/pages/bytes. The size
is determined before the first memory block is created. No need to store
what we can easily calculate - and the calculations even look simpler now.
Michal brought up the idea of variable-sized memory blocks. However, if
we ever implement something like this, we will need an API compatibility
switch and reworks at various places (most code assumes a fixed memory
block size). So let's cleanup what we have right now.
While at it, fix the variable naming in register_mem_sect_under_node() -
we no longer talk about a single section.
Link: http://lkml.kernel.org/r/20190809110200.2746-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's validate the memory block size early, when initializing the memory
device infrastructure. Fail hard in case the value is not suitable.
As nobody checks the return value of memory_dev_init(), turn it into a
void function and fail with a panic in all scenarios instead. Otherwise,
we'll crash later during boot when core/drivers expect that the memory
device infrastructure (including memory_block_size_bytes()) works as
expected.
I think long term, we should move the whole memory block size
configuration (set_memory_block_size_order() and
memory_block_size_bytes()) into drivers/base/memory.c.
Link: http://lkml.kernel.org/r/20190806090142.22709-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: remove quicklist page table caches".
A while ago Nicholas proposed to remove quicklist page table caches [1].
I've rebased his patch on the curren upstream and switched ia64 and sh to
use generic versions of PTE allocation.
[1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com
This patch (of 3):
Remove page table allocator "quicklists". These have been around for a
long time, but have not got much traction in the last decade and are only
used on ia64 and sh architectures.
The numbers in the initial commit look interesting but probably don't
apply anymore. If anybody wants to resurrect this it's in the git
history, but it's unhelpful to have this code and divergent allocator
behaviour for minor archs.
Also it might be better to instead make more general improvements to page
allocator if this is still so slow.
Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[11~From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
Patch series "mm/gup: add make_dirty arg to put_user_pages_dirty_lock()",
v3.
There are about 50+ patches in my tree [2], and I'll be sending out the
remaining ones in a few more groups:
* The block/bio related changes (Jerome mostly wrote those, but I've had
to move stuff around extensively, and add a little code)
* mm/ changes
* other subsystem patches
* an RFC that shows the current state of the tracking patch set. That
can only be applied after all call sites are converted, but it's good to
get an early look at it.
This is part a tree-wide conversion, as described in fc1d8e7cca ("mm:
introduce put_user_page*(), placeholder versions").
This patch (of 3):
Provide more capable variation of put_user_pages_dirty_lock(), and delete
put_user_pages_dirty(). This is based on the following:
1. Lots of call sites become simpler if a bool is passed into
put_user_page*(), instead of making the call site choose which
put_user_page*() variant to call.
2. Christoph Hellwig's observation that set_page_dirty_lock() is
usually correct, and set_page_dirty() is usually a bug, or at least
questionable, within a put_user_page*() calling chain.
This leads to the following API choices:
* put_user_pages_dirty_lock(page, npages, make_dirty)
* There is no put_user_pages_dirty(). You have to
hand code that, in the rare case that it's
required.
[jhubbard@nvidia.com: remove unused variable in siw_free_plist()]
Link: http://lkml.kernel.org/r/20190729074306.10368-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20190724044537.10458-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For debugging purposes it might be useful to keep the owner info even
after page has been freed, and include it in e.g. dump_page() when
detecting a bad page state. For that, change the PAGE_EXT_OWNER flag
meaning to "page owner info has been set at least once" and add new
PAGE_EXT_OWNER_ACTIVE for tracking whether page is supposed to be
currently tracked allocated or free. Adjust dump_page() accordingly,
distinguishing free and allocated pages. In the page_owner debugfs file,
keep printing only allocated pages so that existing scripts are not
confused, and also because free pages are irrelevant for the memory
statistics or leak detection that's the typical use case of the file,
anyway.
Link: http://lkml.kernel.org/r/20190820131828.22684-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull Hyper-V updates from Sasha Levin:
- first round of vmbus hibernation support (Dexuan Cui)
- remove dependencies on PAGE_SIZE (Maya Nakamura)
- move the hyper-v tools/ code into the tools build system (Andy
Shevchenko)
- hyper-v balloon cleanups (Dexuan Cui)
* tag 'hyperv-next-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
Drivers: hv: vmbus: Resume after fixing up old primary channels
Drivers: hv: vmbus: Suspend after cleaning up hv_sock and sub channels
Drivers: hv: vmbus: Clean up hv_sock channels by force upon suspend
Drivers: hv: vmbus: Suspend/resume the vmbus itself for hibernation
Drivers: hv: vmbus: Ignore the offers when resuming from hibernation
Drivers: hv: vmbus: Implement suspend/resume for VSC drivers for hibernation
Drivers: hv: vmbus: Add a helper function is_sub_channel()
Drivers: hv: vmbus: Suspend/resume the synic for hibernation
Drivers: hv: vmbus: Break out synic enable and disable operations
HID: hv: Remove dependencies on PAGE_SIZE for ring buffer
Tools: hv: move to tools buildsystem
hv_balloon: Reorganize the probe function
hv_balloon: Use a static page for the balloon_up send buffer
With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
area. Some architectures map the memmap area with large page size. On
architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
This maps a namespace size of 16G.
When populating memmap region with 16MB page from the device area,
make sure the allocated space is not used to map resources outside this
namespace. Such usage of device area will prevent a namespace destroy.
Add resource end pnf in altmap and use that to check if the memmap area
allocation can map pfn outside the namespace. On ppc64 in such case we fallback
to allocation from memory.
This fix kernel crash reported below:
[ 132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0
[ 133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
[ 133.464760] Faulting instruction address: 0xc00000000007580c
[ 133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
[ 133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
.....
[ 133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
[ 133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
[ 133.464910] Call Trace:
[ 133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable)
[ 133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240
[ 133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590
[ 133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160
[ 133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0
[ 133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50
[ 133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400
[ 133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250
[ 133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
[ 133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
[ 133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
[ 133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250
[ 133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
[ 133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250
djbw: Aneesh notes that this crash can likely be triggered in any kernel that
supports 'papr_scm', so flagging that commit for -stable consideration.
Fixes: b5beae5e22 ("powerpc/pseries: Add driver for PAPR SCM regions")
Cc: <stable@vger.kernel.org>
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Tested-by: Santosh Sivaraj <santosh@fossix.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Link: https://lore.kernel.org/r/20190910062826.10041-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Allow arch to provide the supported alignments and use hugepage alignment only
if we support hugepage. Right now we depend on compile time configs whereas this
patch switch this to runtime discovery.
Architectures like ppc64 can have THP enabled in code, but then can have
hugepage size disabled by the hypervisor. This allows us to create dax devices
with PAGE_SIZE alignment in this case.
Existing dax namespace with alignment larger than PAGE_SIZE will fail to
initialize in this specific case. We still allow fsdax namespace initialization.
With respect to identifying whether to enable hugepage fault for a dax device,
if THP is enabled during compile, we default to taking hugepage fault and in dax
fault handler if we find the fault size > alignment we retry with PAGE_SIZE
fault size.
This also addresses the below failure scenario on ppc64
ndctl create-namespace --mode=devdax | grep align
"align":16777216,
"align":16777216
cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments
65536 16777216
daxio.static-debug -z -o /dev/dax0.0
Bus error (core dumped)
$ dmesg | tail
lpar: Failed hash pte insert with error -4
hash-mmu: mm: Hashing failure ! EA=0x7fff17000000 access=0x8000000000000006 current=daxio
hash-mmu: trap=0x300 vsid=0x22cb7a3 ssize=1 base psize=2 psize 10 pte=0xc000000501002b86
daxio[3860]: bus error (7) at 7fff17000000 nip 7fff973c007c lr 7fff973bff34 code 2 in libpmem.so.1.0.0[7fff973b0000+20000]
daxio[3860]: code: 792945e4 7d494b78 e95f0098 7d494b78 f93f00a0 4800012c e93f0088 f93f0120
daxio[3860]: code: e93f00a0 f93f0128 e93f0120 e95f0128 <f9490000> e93f0088 39290008 f93f0110
The failure was due to guest kernel using wrong page size.
The namespaces created with 16M alignment will appear as below on a config with
16M page size disabled.
$ ndctl list -Ni
[
{
"dev":"namespace0.1",
"mode":"fsdax",
"map":"dev",
"size":5351931904,
"uuid":"fc6e9667-461a-4718-82b4-69b24570bddb",
"align":16777216,
"blockdev":"pmem0.1",
"supported_alignments":[
65536
]
},
{
"dev":"namespace0.0",
"mode":"fsdax", <==== devdax 16M alignment marked disabled.
"map":"mem",
"size":5368709120,
"uuid":"a4bdf81a-f2ee-4bc6-91db-7b87eddd0484",
"state":"disabled"
}
]
Cc: linux-mm@kvack.org
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/20190905154603.10349-8-aneesh.kumar@linux.ibm.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
KVM needs to know if SMT is theoretically possible, this means it is
supported and not forcefully disabled ('nosmt=force'). Create and
export cpu_smt_possible() answering this question.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull backlight updates from Lee Jones:
"Core Frameworks
- Obtain scale type through sysfs
New Functionality:
- Provide Device Tree functionality in rave-sp-backlight
- Calculate if scale type is (non-)linear in pwm_bl
Fix-ups:
- Simplify code in lm3630a_bl
- Trivial rename/whitespace/typo fixes in lms283gf05
- Remove superfluous NULL check in tosa_lcd
- Fix power state initialisation in gpio_backlight
- List supported file in MAINTAINERS
Bug Fixes:
- Kconfig - default to not building unless requested in
{LED,BACKLIGHT}_CLASS_DEVICE"
* tag 'backlight-next-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
backlight: pwm_bl: Set scale type for brightness curves specified in the DT
backlight: pwm_bl: Set scale type for CIE 1931 curves
backlight: Expose brightness curve type through sysfs
MAINTAINERS: Add entry for stable backlight sysfs ABI documentation
backlight: gpio-backlight: Correct initial power state handling
video: backlight: tosa_lcd: drop check because i2c_unregister_device() is NULL safe
video: backlight: Drop default m for {LCD,BACKLIGHT_CLASS_DEVICE}
backlight: lms283gf05: Fix a typo in the description passed to 'devm_gpio_request_one()'
backlight: lm3630a: Switch to use fwnode_property_count_uXX()
backlight: rave-sp: Leave initial state and register with correct device
- Add mediatek support for MT7629 (Jianjun Wang)
* remotes/lorenzo/pci/mediatek:
PCI: mediatek: Add controller support for MT7629
dt-bindings: PCI: Add support for MT7629
- Convert pci_resource_to_user() to a weak function to remove
HAVE_ARCH_PCI_RESOURCE_TO_USER #defines (Denis Efremov)
- Use PCI_SRIOV_NUM_BARS for idiomatic loop structure (Denis Efremov)
- Fix Resizable BAR size suspend/restore for 1MB BARs (Sumit Saxena)
- Correct "pci=resource_alignment" example in documentation (Alexey
Kardashevskiy)
* pci/resource:
PCI: Correct pci=resource_alignment parameter example
PCI: Restore Resizable BAR size bits correctly for 1MB BARs
PCI: Use PCI_SRIOV_NUM_BARS in loops instead of PCI_IOV_RESOURCE_END
PCI: Convert pci_resource_to_user() to a weak function
# Conflicts:
# drivers/pci/pci.c
- Use devm_add_action_or_reset() helper (Fuqian Huang)
- Mark expected switch fall-through (Gustavo A. R. Silva)
- Convert sysfs device attributes from __ATTR() to DEVICE_ATTR() (Kelsey
Skunberg)
- Convert sysfs file permissions from S_IRUSR etc to octal (Kelsey
Skunberg)
- Move SR-IOV sysfs functions to iov.c (Kelsey Skunberg)
- Add pci_info_ratelimited() to ratelimit PCI messages separately
(Krzysztof Wilczynski)
- Fix "'static' not at beginning of declaration" warnings (Krzysztof
Wilczynski)
- Clean up resource_alignment parameter to not require static buffer
(Logan Gunthorpe)
- Add ACS quirk for iProc PAXB (Abhinav Ratna)
- Add pci_irq_vector() and other stubs for !CONFIG_PCI (Herbert Xu)
* pci/misc:
PCI: Add pci_irq_vector() and other stubs when !CONFIG_PCI
PCI: Add ACS quirk for iProc PAXB
PCI: Force trailing new line to resource_alignment_param in sysfs
PCI: Move pci_[get|set]_resource_alignment_param() into their callers
PCI: Clean up resource_alignment parameter to not require static buffer
PCI: Use static const struct, not const static struct
PCI: Add pci_info_ratelimited() to ratelimit PCI separately
PCI/IOV: Remove group write permission from sriov_numvfs, sriov_drivers_autoprobe
PCI/IOV: Move sysfs SR-IOV functions to iov.c
PCI: sysfs: Change permissions from symbolic to octal
PCI: sysfs: Change DEVICE_ATTR() to DEVICE_ATTR_WO()
PCI: sysfs: Define device attributes with DEVICE_ATTR*()
PCI: Mark expected switch fall-through
PCI: Use devm_add_action_or_reset()
- Consolidate _HPP & _HPX code in pci-acpi.h and remove unnecessary
struct hotplug_program_ops (Krzysztof Wilczynski)
- Fixup PCIe device types to remove the need for dev->has_secondary_link
(Mika Westerberg)
* pci/enumeration:
PCI: Get rid of dev->has_secondary_link flag
PCI: Make pcie_downstream_port() available outside of access.c
PCI/ACPI: Remove unnecessary struct hotplug_program_ops
PCI/ACPI: Move _HPP & _HPX functions to pci-acpi.c
PCI/ACPI: Rename _HPX structs from hpp_* to hpx_*
- Move many symbols from public linux/pci.h to subsystem-private
drivers/pci/pci.h (Kelsey Skunberg)
- Unexport pci_bus_get() and pci_bus_sem since they're not needed by
modules (Kelsey Skunberg)
- Remove unused pci_block_cfg_access() et al (Kelsey Skunberg)
* pci/encapsulate:
PCI: Make pci_set_of_node(), etc private
PCI: Make pci_enable_ptm() private
PCI: Make pcie_set_ecrc_checking(), pcie_ecrc_get_policy() private
PCI: Make pci_ats_init() private
PCI: Make pcie_update_link_speed() private
PCI: Make pci_bus_get(), pci_bus_put() private
PCI: Make pci_hotplug_io_size, mem_size, and bus_size private
PCI: Make pci_save_vc_state(), pci_restore_vc_state(), etc private
PCI: Make pci_get_host_bridge_device(), pci_put_host_bridge_device() private
PCI: Make pci_check_pme_status(), pci_pme_wakeup_bus() private
PCI: Make PCI_PM_* delay times private
PCI: Unexport pci_bus_sem
PCI: Unexport pci_bus_get() and pci_bus_put()
PCI: Remove pci_block_cfg_access() et al (unused)
Pull HID updates from Jiri Kosina:
- syzbot memory corruption fixes for hidraw, Prodikeys, Logitech and
Sony drivers from Alan Stern and Roderick Colenbrander
- stuck 'fn' key fix for hid-apple from Joao Moreno
- proper propagation of EPOLLOUT from hiddev and hidraw, from Fabian
Henneke
- fixes for handling power management for intel-ish devices with NO_D3
flag set, from Zhang Lixu
- extension of supported usage range for customer page, as some
Logitech devices are actually making use of it. From Olivier Gay.
- hid-multitouch is no longer filtering mice node creation, from
Benjamin Tissoires
- MobileStudio Pro 13 support, from Ping Cheng
- a few other device ID additions and assorted smaller fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: (27 commits)
HID: core: fix dmesg flooding if report field larger than 32bit
HID: core: Add printk_once variants to hid_warn() etc
HID: core: reformat and reduce hid_printk macros
HID: prodikeys: Fix general protection fault during probe
HID: wacom: add new MobileStudio Pro 13 support
HID: sony: Fix memory corruption issue on cleanup.
HID: i2c-hid: modify quirks for weida's devices
HID: apple: Fix stuck function keys when using FN
HID: sb0540: add support for Creative SB0540 IR receivers
HID: Add quirk for HP X500 PIXART OEM mouse
HID: logitech-dj: Fix crash when initial logi_dj_recv_query_paired_devices fails
hid-logitech-dj: add the new Lightspeed receiver
HID: logitech-dj: add support of the G700(s) receiver
HID: multitouch: add support for the Smart Tech panel
HID: multitouch: do not filter mice nodes
HID: do not call hid_set_drvdata(hdev, NULL) in drivers
HID: wacom: do not call hid_set_drvdata(hdev, NULL)
HID: logitech: Fix general protection fault caused by Logitech driver
HID: hidraw: Fix invalid read in hidraw_ioctl
HID: wacom: support named keys on older devices
...
Pull selinux updates from Paul Moore:
- Add LSM hooks, and SELinux access control hooks, for dnotify,
fanotify, and inotify watches. This has been discussed with both the
LSM and fs/notify folks and everybody is good with these new hooks.
- The LSM stacking changes missed a few calls to current_security() in
the SELinux code; we fix those and remove current_security() for
good.
- Improve our network object labeling cache so that we always return
the object's label, even when under memory pressure. Previously we
would return an error if we couldn't allocate a new cache entry, now
we always return the label even if we can't create a new cache entry
for it.
- Convert the sidtab atomic_t counter to a normal u32 with
READ/WRITE_ONCE() and memory barrier protection.
- A few patches to policydb.c to clean things up (remove forward
declarations, long lines, bad variable names, etc)
* tag 'selinux-pr-20190917' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
lsm: remove current_security()
selinux: fix residual uses of current_security() for the SELinux blob
selinux: avoid atomic_t usage in sidtab
fanotify, inotify, dnotify, security: add security hook for fs notifications
selinux: always return a secid from the network caches if we find one
selinux: policydb - rename type_val_to_struct_array
selinux: policydb - fix some checkpatch.pl warnings
selinux: shuffle around policydb.c to get rid of forward declarations
It's an unusual configuration, and was apparently never tested, and not
caught in linux-next because of a combination of travels and it making
it into the tree too late.
The fix is to simply move the #define to outside the CONFIG_MODULE
section, since MODULE_INFO() will do the right thing.
Cc: Martijn Coenen <maco@android.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Matthias Maennich <maennich@google.com>
Cc: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull soundwire updates from Vinod Koul:
"This includes DT support thanks to Srini and more work done by Intel
(Pierre) on improving cadence and intel support.
Summary:
- Add DT bindings and DT support in core
- Add debugfs support for soundwire properties
- Improvements on streaming handling to core
- Improved handling of Cadence module
- More updates and improvements to Intel driver"
* tag 'soundwire-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire: (30 commits)
soundwire: stream: make stream name a const pointer
soundwire: Add compute_params callback
soundwire: core: add device tree support for slave devices
dt-bindings: soundwire: add slave bindings
soundwire: bus: set initial value to port_status
soundwire: intel: handle disabled links
soundwire: intel: add debugfs register dump
soundwire: cadence_master: add debugfs register dump
soundwire: add debugfs support
soundwire: intel: remove unused variables
soundwire: intel: move shutdown() callback and don't export symbol
soundwire: cadence_master: add kernel parameter to override interrupt mask
soundwire: intel_init: add kernel module parameter to filter out links
soundwire: cadence_master: fix divider setting in clock register
soundwire: cadence_master: make use of mclk_freq property
soundwire: intel: read mclk_freq property from firmware
soundwire: add new mclk_freq field for properties
soundwire: stream: remove unnecessary variable initializations
soundwire: stream: fix disable sequence
soundwire: include mod_devicetable.h to avoid compiling warnings
...
Pull modules updates from Jessica Yu:
"The main bulk of this pull request introduces a new exported symbol
namespaces feature. The number of exported symbols is increasingly
growing with each release (we're at about 31k exports as of 5.3-rc7)
and we currently have no way of visualizing how these symbols are
"clustered" or making sense of this huge export surface.
Namespacing exported symbols allows kernel developers to more
explicitly partition and categorize exported symbols, as well as more
easily limiting the availability of namespaced symbols to other parts
of the kernel. For starters, we have introduced the USB_STORAGE
namespace to demonstrate the API's usage. I have briefly summarized
the feature and its main motivations in the tag below.
Summary:
- Introduce exported symbol namespaces.
This new feature allows subsystem maintainers to partition and
categorize their exported symbols into explicit namespaces. Module
authors are now required to import the namespaces they need.
Some of the main motivations of this feature include: allowing
kernel developers to better manage the export surface, allow
subsystem maintainers to explicitly state that usage of some
exported symbols should only be limited to certain users (think:
inter-module or inter-driver symbols, debugging symbols, etc), as
well as more easily limiting the availability of namespaced symbols
to other parts of the kernel.
With the module import requirement, it is also easier to spot the
misuse of exported symbols during patch review.
Two new macros are introduced: EXPORT_SYMBOL_NS() and
EXPORT_SYMBOL_NS_GPL(). The API is thoroughly documented in
Documentation/kbuild/namespaces.rst.
- Some small code and kbuild cleanups here and there"
* tag 'modules-for-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: Remove leftover '#undef' from export header
module: remove unneeded casts in cmp_name()
module: move CONFIG_UNUSED_SYMBOLS to the sub-menu of MODULES
module: remove redundant 'depends on MODULES'
module: Fix link failure due to invalid relocation on namespace offset
usb-storage: export symbols in USB_STORAGE namespace
usb-storage: remove single-use define for debugging
docs: Add documentation for Symbol Namespaces
scripts: Coccinelle script for namespace dependencies.
modpost: add support for generating namespace dependencies
export: allow definition default namespaces in Makefiles or sources
module: add config option MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS
modpost: add support for symbol namespaces
module: add support for symbol namespaces.
export: explicitly align struct kernel_symbol
module: support reading multiple values per modinfo tag
Pull MIPS updates from Paul Burton:
"Main MIPS changes:
- boot_mem_map is removed, providing a nice cleanup made possible by
the recent removal of bootmem.
- Some fixes to atomics, in general providing compiler barriers for
smp_mb__{before,after}_atomic plus fixes specific to Loongson CPUs
or MIPS32 systems using cmpxchg64().
- Conversion to the new generic VDSO infrastructure courtesy of
Vincenzo Frascino.
- Removal of undefined behavior in set_io_port_base(), fixing the
behavior of some MIPS kernel configurations when built with recent
clang versions.
- Initial MIPS32 huge page support, functional on at least Ingenic
SoCs.
- pte_special() is now supported for some configurations, allowing
among other things generic fast GUP to be used.
- Miscellaneous fixes & cleanups.
And platform specific changes:
- Major improvements to Ingenic SoC support from Paul Cercueil,
mostly enabled by the inclusion of the new TCU (timer-counter unit)
drivers he's spent a very patient year or so working on. Plus some
fixes for X1000 SoCs from Zhou Yanjie.
- Netgear R6200 v1 systems are now supported by the bcm47xx platform.
- DT updates for BMIPS, Lantiq & Microsemi Ocelot systems"
* tag 'mips_5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (89 commits)
MIPS: Detect bad _PFN_SHIFT values
MIPS: Disable pte_special() for MIPS32 with RiXi
MIPS: ralink: deactivate PCI support for SOC_MT7621
mips: compat: vdso: Use legacy syscalls as fallback
MIPS: Drop Loongson _CACHE_* definitions
MIPS: tlbex: Remove cpu_has_local_ebase
MIPS: tlbex: Simplify r3k check
MIPS: Select R3k-style TLB in Kconfig
MIPS: PCI: refactor ioc3 special handling
mips: remove ioremap_cachable
mips/atomic: Fix smp_mb__{before,after}_atomic()
mips/atomic: Fix loongson_llsc_mb() wreckage
mips/atomic: Fix cmpxchg64 barriers
MIPS: Octeon: remove duplicated include from dma-octeon.c
firmware: bcm47xx_nvram: Allow COMPILE_TEST
firmware: bcm47xx_nvram: Correct size_t printf format
MIPS: Treat Loongson Extensions as ASEs
MIPS: Remove dev_err() usage after platform_get_irq()
MIPS: dts: mscc: describe the PTP ready interrupt
MIPS: dts: mscc: describe the PTP register range
...
Pull f2fs updates from Jaegeuk Kim:
"In this round, we introduced casefolding support in f2fs, and fixed
various bugs in individual features such as IO alignment,
checkpoint=disable, quota, and swapfile.
Enhancement:
- support casefolding w/ enhancement in ext4
- support fiemap for directory
- support FS_IO_GET|SET_FSLABEL
Bug fix:
- fix IO stuck during checkpoint=disable
- avoid infinite GC loop
- fix panic/overflow related to IO alignment feature
- fix livelock in swap file
- fix discard command leak
- disallow dio for atomic_write"
* tag 'f2fs-for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (51 commits)
f2fs: add a condition to detect overflow in f2fs_ioc_gc_range()
f2fs: fix to add missing F2FS_IO_ALIGNED() condition
f2fs: fix to fallback to buffered IO in IO aligned mode
f2fs: fix to handle error path correctly in f2fs_map_blocks
f2fs: fix extent corrupotion during directIO in LFS mode
f2fs: check all the data segments against all node ones
f2fs: Add a small clarification to CONFIG_FS_F2FS_FS_SECURITY
f2fs: fix inode rwsem regression
f2fs: fix to avoid accessing uninitialized field of inode page in is_alive()
f2fs: avoid infinite GC loop due to stale atomic files
f2fs: Fix indefinite loop in f2fs_gc()
f2fs: convert inline_data in prior to i_size_write
f2fs: fix error path of f2fs_convert_inline_page()
f2fs: add missing documents of reserve_root/resuid/resgid
f2fs: fix flushing node pages when checkpoint is disabled
f2fs: enhance f2fs_is_checkpoint_ready()'s readability
f2fs: clean up __bio_alloc()'s parameter
f2fs: fix wrong error injection path in inc_valid_block_count()
f2fs: fix to writeout dirty inode during node flush
f2fs: optimize case-insensitive lookups
...
Pull ext2, quota, udf fixes and cleanups from Jan Kara:
- two small quota fixes (in grace time handling and possible missed
accounting of preallocated blocks beyond EOF).
- some ext2 cleanups
- udf fixes for better compatibility with Windows 10 generated media
(named streams, write-protection using domain-identifier, placement
of volume recognition sequence)
- some udf cleanups
* tag 'for_v5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: fix wrong condition in is_quota_modification()
fs-udf: Delete an unnecessary check before brelse()
ext2: Delete an unnecessary check before brelse()
udf: Drop forward function declarations
udf: Verify domain identifier fields
udf: augment UDF permissions on new inodes
udf: Use dynamic debug infrastructure
udf: reduce leakage of blocks related to named streams
udf: prevent allocation beyond UDF partition
quota: fix condition for resetting time limit in do_set_dqblk()
ext2: code cleanup for ext2_free_blocks()
ext2: fix block range in ext2_data_block_valid()
udf: support 2048-byte spacing of VRS descriptors on 4K media
udf: refactor VRS descriptor identification
Pull MTD updates from Richard Weinberger:
"MTD core changes:
- add debugfs nodes for querying the flash name and id
- mtd parser reorganization
SPI NOR core changes:
- always use bounce buffer for register read/writes
- move m25p80 code in spi-nor.c
- rework hwcaps selection for the spi-mem case
- rework the core in order to move the manufacturer specific code out
of it:
- regroup flash parameters in 'struct spi_nor_flash_parameter'
- add default_init() and post_sfdp() hooks to tweak the flash
parameters
- introduce the ->set_4byte(), ->convert_addr() and ->setup()
methods, to deal with manufacturer specific code
- rework the SPI NOR lock/unlock logic
- fix an error code in spi_nor_read_raw()
- fix a memory leak bug
- enable the debugfs for the partname and partid
- add support for few flashes
SPI NOR controller drivers changes:
- intel-spi:
- Whitelist 4B read commands
- Add support for Intel Tiger Lake SPI serial flash
- aspeed-smc: Add of_node_put()
- hisi-sfc: add of_node_put()
- cadence-quadspi: Fix QSPI RCU Schedule Stall
NAND core:
- Fixing typos
- Adding missing of_node_put() in various drivers
Raw NAND controller drivers:
- Macronix: new controller driver
- Omap2: fix the number of bitflips returned
- Brcmnand: fix a pointer not iterating over all the page chunks
- W90x900: driver removed
- Onenand: fix a memory leak
- Sharpsl: missing include guard
- STM32: avoid warnings when building with W=1
- Ingenic: fix a coccinelle warning
- r852: call a helper to simplify the code
CFI core:
- Kill useless initializer in mtd_do_chip_probe()
- Fix a rare write failure seen on some cfi_cmdset_0002 compliant
Parallel NORs
- Bunch of cleanups for cfi_cmdset_0002 driver's write functions by
Tokunori Ikegami"
* tag 'mtd/for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux: (77 commits)
mtd: pmc551: Remove set but not used variable 'soff_lo'
mtd: cfi_cmdset_0002: Fix do_erase_chip() to get chip as erasing mode
mtd: sm_ftl: Fix memory leak in sm_init_zone() error path
mtd: parsers: Move CMDLINE parser
mtd: parsers: Move OF parser
mtd: parsers: Move BCM63xx parser
mtd: parsers: Move BCM47xx parser
mtd: parsers: Move TI AR7 parser
mtd: pismo: Simplify getting the adapter of a client
mtd: phram: Module parameters add writable permissions
mtd: pxa2xx: Use ioremap_cache insted of ioremap_cached
mtd: spi-nor: Rename "n25q512a" to "mt25qu512a (n25q512a)"
mtd: spi-nor: Add support for mt35xu02g
mtd: rawnand: omap2: Fix number of bitflips reporting with ELM
mtd: rawnand: brcmnand: Fix ecc chunk calculation for erased page bitfips
mtd: spi-nor: remove superfluous pass of nor->info->sector_size
mtd: spi-nor: enable the debugfs for the partname and partid
mtd: mtdcore: add debugfs nodes for querying the flash name and id
mtd: spi-nor: hisi-sfc: Add of_node_put() before break
mtd: spi-nor: aspeed-smc: Add of_node_put()
...