Architectures like arm64 have HugeTLB page sizes which are different
than generic sizes at PMD, PUD, PGD level and implemented via contiguous
bits. At present these special size HugeTLB pages cannot be identified
through macros like (PMD|PUD|PGDIR)_SHIFT and hence chosen not be
migrated.
Enabling migration support for these special HugeTLB page sizes along
with the generic ones (PMD|PUD|PGD) would require identifying all of
them on a given platform. A platform specific hook can precisely
enumerate all huge page sizes supported for migration. Instead of
comparing against standard huge page orders let
hugetlb_migration_support() function call a platform hook
arch_hugetlb_migration_support(). Default definition for the platform
hook maintains existing semantics which checks standard huge page order.
But an architecture can choose to override the default and provide
support for a comprehensive set of huge page sizes.
Link: http://lkml.kernel.org/r/1545121450-1663-4-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "arm64/mm: Enable HugeTLB migration", v4.
This patch series enables HugeTLB migration support for all supported
huge page sizes at all levels including contiguous bit implementation.
Following HugeTLB migration support matrix has been enabled with this
patch series. All permutations have been tested except for the 16GB.
CONT PTE PMD CONT PMD PUD
-------- --- -------- ---
4K: 64K 2M 32M 1G
16K: 2M 32M 1G
64K: 2M 512M 16G
First the series adds migration support for PUD based huge pages. It
then adds a platform specific hook to query an architecture if a given
huge page size is supported for migration while also providing a default
fallback option preserving the existing semantics which just checks for
(PMD|PUD|PGDIR)_SHIFT macros. The last two patches enables HugeTLB
migration on arm64 and subscribe to this new platform specific hook by
defining an override.
The second patch differentiates between movability and migratability
aspects of huge pages and implements hugepage_movable_supported() which
can then be used during allocation to decide whether to place the huge
page in movable zone or not.
This patch (of 5):
During huge page allocation it's migratability is checked to determine
if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE.
But the movability aspect of the huge page could depend on other factors
than just migratability. Movability in itself is a distinct property
which should not be tied with migratability alone.
This differentiates these two and implements an enhanced movability check
which also considers huge page size to determine if it is feasible to be
placed under a movable zone. At present it just checks for gigantic pages
but going forward it can incorporate other enhanced checks.
Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds a new kernel module for analysis of vmalloc allocator. It is
only enabled as a module. There are two main reasons this module should
be used for: performance evaluation and stressing of vmalloc subsystem.
It consists of several test cases. As of now there are 8. The module
has five parameters we can specify to change its the behaviour.
1) run_test_mask - set of tests to be run
id: 1, name: fix_size_alloc_test
id: 2, name: full_fit_alloc_test
id: 4, name: long_busy_list_alloc_test
id: 8, name: random_size_alloc_test
id: 16, name: fix_align_alloc_test
id: 32, name: random_size_align_alloc_test
id: 64, name: align_shift_alloc_test
id: 128, name: pcpu_alloc_test
By default all tests are in run test mask. If you want to select some
specific tests it is possible to pass the mask. For example for first,
second and fourth tests we go 11 value.
2) test_repeat_count - how many times each test should be repeated
By default it is one time per test. It is possible to pass any number.
As high the value is the test duration gets increased.
3) test_loop_count - internal test loop counter. By default it is set
to 1000000.
4) single_cpu_test - use one CPU to run the tests
By default this parameter is set to false. It means that all online
CPUs execute tests. By setting it to 1, the tests are executed by
first online CPU only.
5) sequential_test_order - run tests in sequential order
By default this parameter is set to false. It means that before running
tests the order is shuffled. It is possible to make it sequential, just
set it to 1.
Performance analysis:
In order to evaluate performance of vmalloc allocations, usually it
makes sense to use only one CPU that runs tests, use sequential order,
number of repeat tests can be different as well as set of test mask.
For example if we want to run all tests, to use one CPU and repeat each
test 3 times. Insert the module passing following parameters:
single_cpu_test=1 sequential_test_order=1 test_repeat_count=3
with following output:
<snip>
Summary: fix_size_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 901177 usec
Summary: full_fit_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 1039341 usec
Summary: long_busy_list_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 11775763 usec
Summary: random_size_alloc_test passed 3: failed: 0 repeat: 3 loops: 1000000 avg: 6081992 usec
Summary: fix_align_alloc_test passed: 3 failed: 0 repeat: 3, loops: 1000000 avg: 2003712 usec
Summary: random_size_align_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 2895689 usec
Summary: align_shift_alloc_test passed: 0 failed: 3 repeat: 3 loops: 1000000 avg: 573 usec
Summary: pcpu_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 95802 usec
All test took CPU0=192945605995 cycles
<snip>
The align_shift_alloc_test is expected to be failed.
Stressing:
In order to stress the vmalloc subsystem we run all available test cases
on all available CPUs simultaneously. In order to prevent constant behaviour
pattern, the test cases array is shuffled by default to randomize the order
of test execution.
For example if we want to run all tests(default), use all online CPUs(default)
with shuffled order(default) and to repeat each test 30 times. The command
would be like:
modprobe vmalloc_test test_repeat_count=30
Expected results are the system is alive, there are no any BUG_ONs or Kernel
Panics the tests are completed, no memory leaks.
[urezki@gmail.com: fix 32-bit builds]
Link: http://lkml.kernel.org/r/20190106214839.ffvjvmrn52uqog7k@pc636
[urezki@gmail.com: make CONFIG_TEST_VMALLOC depend on CONFIG_MMU]
Link: http://lkml.kernel.org/r/20190219085441.s6bg2gpy4esny5vw@pc636
Link: http://lkml.kernel.org/r/20190103142108.20744-3-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When VM_NO_GUARD is not set area->size includes adjacent guard page,
thus for correct size checking get_vm_area_size() should be used, but
not area->size.
This fixes possible kernel oops when userspace tries to mmap an area on
1 page bigger than was allocated by vmalloc_user() call: the size check
inside remap_vmalloc_range_partial() accounts non-existing guard page
also, so check successfully passes but vmalloc_to_page() returns NULL
(guard page does not physically exist).
The following code pattern example should trigger an oops:
static int oops_mmap(struct file *file, struct vm_area_struct *vma)
{
void *mem;
mem = vmalloc_user(4096);
BUG_ON(!mem);
/* Do not care about mem leak */
return remap_vmalloc_range(vma, mem, 0);
}
And userspace simply mmaps size + PAGE_SIZE:
mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0);
Possible candidates for oops which do not have any explicit size
checks:
*** drivers/media/usb/stkwebcam/stk-webcam.c:
v4l_stk_mmap[789] ret = remap_vmalloc_range(vma, sbuf->buffer, 0);
Or the following one:
*** drivers/video/fbdev/core/fbmem.c
static int
fb_mmap(struct file *file, struct vm_area_struct * vma)
...
res = fb->fb_mmap(info, vma);
Where fb_mmap callback calls remap_vmalloc_range() directly without any
explicit checks:
*** drivers/video/fbdev/vfb.c
static int vfb_mmap(struct fb_info *info,
struct vm_area_struct *vma)
{
return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff);
}
Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.de
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch repeats the original one from David S Miller:
2dca6999ee ("mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA")
but for missed vmalloc_32_user() case, which also requires correct
alignment of virtual address on kernel side to avoid D-caches aliases.
A bit of copy-paste from original patch to recover in memory of what is
all about:
When a vmalloc'd area is mmap'd into userspace, some kind of
co-ordination is necessary for this to work on platforms with cpu
D-caches which can have aliases.
Otherwise kernel side writes won't be seen properly in userspace and
vice versa.
If the kernel side mapping and the user side one have the same
alignment, modulo SHMLBA, this can work as long as VM_SHARED is shared
of VMA and for all current users this is true. VM_SHARED will force
SHMLBA alignment of the user side mmap on platforms with D-cache
aliasing matters.
David S. Miller
> What are the user-visible runtime effects of this change?
In simple words: proper alignment avoids possible difference in data,
seen by different virtual mapings: userspace and kernel in our case.
I.e. userspace reads cache line A, kernel writes to cache line B. Both
cache lines correspond to the same physical memory (thus aliases).
So this should fix data corruption for archs with vivt and vipt caches,
e.g. armv6. Personally I've never worked with this archs, I just
spotted the strange difference in code: for one case we do alignment,
for another - not. I have a strong feeling that David simply missed
vmalloc_32_user() case.
>
> Is a -stable backport needed?
No, I do not think so. The only one user of vmalloc_32_user() is
virtual frame buffer device drivers/video/fbdev/vfb.c, which has in the
description "The main use of this frame buffer device is testing and
debugging the frame buffer subsystem. Do NOT enable it for normal
systems!".
And it seems to me that this vfb.c does not need 32bit addressable pages
(vmalloc_32_user() case), because it is virtual device and should not
care about things like dma32 zones, etc. Probably is better to clean
the code and switch vfb.c from vmalloc_32_user() to vmalloc_user() case
and wipe out vmalloc_32_user() from vmalloc.c completely. But I'm not
very much sure that this is worth to do, that's so minor, so we can
leave it as is.
Link: http://lkml.kernel.org/r/20190108110944.23591-1-rpenyaev@suse.de
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an optimization for KSM pages almost in the same way that we have
for ordinary anonymous pages. If there is a write fault in a page,
which is mapped to an only pte, and it is not related to swap cache; the
page may be reused without copying its content.
[ Note that we do not consider PageSwapCache() pages at least for now,
since we don't want to complicate __get_ksm_page(), which has nice
optimization based on this (for the migration case). Currenly it is
spinning on PageSwapCache() pages, waiting for when they have
unfreezed counters (i.e., for the migration finish). But we don't want
to make it also spinning on swap cache pages, which we try to reuse,
since there is not a very high probability to reuse them. So, for now
we do not consider PageSwapCache() pages at all. ]
So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
page_stable_node(), to skip a page, which KSM is currently trying to
link to stable tree. Then we do page_ref_freeze() to prohibit KSM to
merge one more page into the page, we are reusing. After that, nobody
can refer to the reusing page: KSM skips !PageSwapCache() pages with
zero refcount; and the protection against of all other participants is
the same as for reused ordinary anon pages pte lock, page lock and
mmap_sem.
[akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
All these places for replacement were found by running the following
grep patterns on the entire kernel code. Please let me know if this
might have missed some instances. This might also have replaced some
false positives. I will appreciate suggestions, inputs and review.
1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"
This patch (of 2):
At present there are multiple places where invalid node number is
encoded as -1. Even though implicitly understood it is always better to
have macros in there. Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE. This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.
Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/kdump: allow to exclude pages that are logically
offline"
Right now, pages inflated as part of a balloon driver will be dumped by
dump tools like makedumpfile. While XEN is able to check in the crash
kernel whether a certain pfn is actuall backed by memory in the
hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
virtio-balloon, hv-balloon and VMWare balloon inflated memory will
essentially result in zero pages getting allocated by the hypervisor and
the dump getting filled with this data.
The allocation and reading of zero pages can directly be avoided if a
dumping tool could know which pages only contain stale information not
to be dumped.
Also for XEN, calling into the kernel and asking the hypervisor if a pfn
is backed can be avoided if the duming tool would skip such pages right
from the beginning.
Dumping tools have no idea whether a given page is part of a balloon
driver and shall not be dumped. Esp. PG_reserved cannot be used for
that purpose as all memory allocated during early boot is also
PG_reserved, see discussion at [1]. So some other way of indication is
required and a new page flag is frowned upon.
We have PG_balloon (MAPCOUNT value), which is essentially unused now. I
suggest renaming it to something more generic (PG_offline) to mark pages
as logically offline. This flag can than e.g. also be used by
virtio-mem in the future to mark subsections as offline. Or by other
code that wants to put pages logically offline (e.g. later maybe
poisoned pages that shall no longer be used).
This series converts PG_balloon to PG_offline, allows dumping tools to
query the value to detect such pages and marks pages in the hv-balloon
and XEN balloon properly as PG_offline. Note that virtio-balloon
already set pages to PG_balloon (and now PG_offline).
Please note that this is also helpful for a problem we were seeing under
Hyper-V: Dumping logically offline memory (pages kept fake offline while
onlining a section via online_page_callback) would under some condicions
result in a kernel panic when dumping them.
As I don't have access to neither XEN nor Hyper-V nor VMWare
installations, this was only tested with the virtio-balloon and pages
were properly skipped when dumping. I'll also attach the makedumpfile
patch to this series.
[1] https://lkml.org/lkml/2018/7/20/566
This patch (of 8):
Commit b1123ea6d3 ("mm: balloon: use general non-lru movable page
feature") reworked balloon handling to make use of the general non-lru
movable page feature. The big comment block in balloon_compaction.h
contains quite some outdated information. Let's fix this.
Link: http://lkml.kernel.org/r/20181119101616.8901-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Christian Hansen <chansen3@cisco.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juergen Gross <jgross@suse.com>
Cc: Julien Freche <jfreche@vmware.com>
Cc: Kairui Song <kasong@redhat.com>
Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Lianbo Jiang <lijiang@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Pankaj gupta <pagupta@redhat.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kmemleak throws endless warnings during boot due to in
__alloc_alien_cache(),
alc = kmalloc_node(memsize, gfp, node);
init_arraycache(&alc->ac, entries, batch);
kmemleak_no_scan(ac);
Kmemleak does not track the array cache (alc->ac) but the alien cache
(alc) instead, so let it track the latter by lifting kmemleak_no_scan()
out of init_arraycache().
There is another place that calls init_arraycache(), but
alloc_kmem_cache_cpus() uses the percpu allocation where will never be
considered as a leak.
kmemleak: Found object by alias at 0xffff8007b9aa7e38
CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
Call trace:
dump_backtrace+0x0/0x168
show_stack+0x24/0x30
dump_stack+0x88/0xb0
lookup_object+0x84/0xac
find_and_get_object+0x84/0xe4
kmemleak_no_scan+0x74/0xf4
setup_kmem_cache_node+0x2b4/0x35c
__do_tune_cpucache+0x250/0x2d4
do_tune_cpucache+0x4c/0xe4
enable_cpucache+0xc8/0x110
setup_cpu_cache+0x40/0x1b8
__kmem_cache_create+0x240/0x358
create_cache+0xc0/0x198
kmem_cache_create_usercopy+0x158/0x20c
kmem_cache_create+0x50/0x64
fsnotify_init+0x58/0x6c
do_one_initcall+0x194/0x388
kernel_init_freeable+0x668/0x688
kernel_init+0x18/0x124
ret_from_fork+0x10/0x18
kmemleak: Object 0xffff8007b9aa7e00 (size 256):
kmemleak: comm "swapper/0", pid 1, jiffies 4294697137
kmemleak: min_count = 1
kmemleak: count = 0
kmemleak: flags = 0x1
kmemleak: checksum = 0
kmemleak: backtrace:
kmemleak_alloc+0x84/0xb8
kmem_cache_alloc_node_trace+0x31c/0x3a0
__kmalloc_node+0x58/0x78
setup_kmem_cache_node+0x26c/0x35c
__do_tune_cpucache+0x250/0x2d4
do_tune_cpucache+0x4c/0xe4
enable_cpucache+0xc8/0x110
setup_cpu_cache+0x40/0x1b8
__kmem_cache_create+0x240/0x358
create_cache+0xc0/0x198
kmem_cache_create_usercopy+0x158/0x20c
kmem_cache_create+0x50/0x64
fsnotify_init+0x58/0x6c
do_one_initcall+0x194/0x388
kernel_init_freeable+0x668/0x688
kernel_init+0x18/0x124
kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
Call trace:
dump_backtrace+0x0/0x168
show_stack+0x24/0x30
dump_stack+0x88/0xb0
kmemleak_no_scan+0x90/0xf4
setup_kmem_cache_node+0x2b4/0x35c
__do_tune_cpucache+0x250/0x2d4
do_tune_cpucache+0x4c/0xe4
enable_cpucache+0xc8/0x110
setup_cpu_cache+0x40/0x1b8
__kmem_cache_create+0x240/0x358
create_cache+0xc0/0x198
kmem_cache_create_usercopy+0x158/0x20c
kmem_cache_create+0x50/0x64
fsnotify_init+0x58/0x6c
do_one_initcall+0x194/0x388
kernel_init_freeable+0x668/0x688
kernel_init+0x18/0x124
ret_from_fork+0x10/0x18
Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
Fixes: 1fe00d50a9 ("slab: factor out initialization of array cache")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the code to use a zero-sized array instead of a pointer in
structure ocfs2_slot_info and use struct_size() in kzalloc().
Notice that one of the more common cases of allocation size calculations
is finding the size of a structure that has a zero-sized array at the
end, along with memory for some number of elements for that array. For
example:
struct foo {
int stuff;
void *entry[];
};
instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Link: http://lkml.kernel.org/r/20190108191903.GA22056@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The user reported this problem, the upper application IO was timeout
when fstrim was running on this ocfs2 partition. the application
monitoring resource agent considered that this application did not work,
then this node was fenced by the cluster brain (e.g. pacemaker).
The root cause is that fstrim thread always holds main_bm meta-file
related locks until all the cluster groups are trimmed. This patch will
make fstrim thread release main_bm meta-file related locks when each
cluster group is trimmed, this will let the current application IO has a
chance to claim the clusters from main_bm meta-file.
Link: http://lkml.kernel.org/r/20190111090014.31645-1-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the process of creating a node, it will cause NULL pointer
dereference in kernel if o2cb_ctl failed in the interval (mkdir,
o2cb_set_node_attribute(node_num)] in function o2cb_add_node.
The node num is initialized to 0 in function o2nm_node_group_make_item,
o2nm_node_group_drop_item will mistake the node number 0 for a valid
node number when we delete the node before the node number is set
correctly. If the local node number of the current host happens to be
0, cluster->cl_local_node will be set to O2NM_INVALID_NODE_NUM while
o2hb_thread still running. The panic stack is generated as follows:
o2hb_thread
\-o2hb_do_disk_heartbeat
\-o2hb_check_own_slot
|-slot = ®->hr_slots[o2nm_this_node()];
//o2nm_this_node() return O2NM_INVALID_NODE_NUM
We need to check whether the node number is set when we delete the node.
Link: http://lkml.kernel.org/r/133d8045-72cc-863e-8eae-5013f9f6bc51@huawei.com
Signed-off-by: Jia Guo <guojia12@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Building little-endian allmodconfig kernels on arm64 started failing
with the generated atomic.h implementation, since we now try to call
kasan helpers from the EFI stub:
aarch64-linux-gnu-ld: drivers/firmware/efi/libstub/arm-stub.stub.o: in function `atomic_set':
include/generated/atomic-instrumented.h:44: undefined reference to `__efistub_kasan_check_write'
I suspect that we get similar problems in other files that explicitly
disable KASAN for some reason but call atomic_t based helper functions.
We can fix this by checking the predefined __SANITIZE_ADDRESS__ macro
that the compiler sets instead of checking CONFIG_KASAN, but this in
turn requires a small hack in mm/kasan/common.c so we do see the extern
declaration there instead of the inline function.
Link: http://lkml.kernel.org/r/20181211133453.2835077-1-arnd@arndb.de
Fixes: b1864b828644 ("locking/atomics: build atomic headers as required")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Anders Roxell <anders.roxell@linaro.org>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
It triggers false positives in the allocation path:
BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
Read of size 8 at addr ffff88881f800000 by task swapper/0
CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
Call Trace:
dump_stack+0xe0/0x19a
print_address_description.cold.2+0x9/0x28b
kasan_report.cold.3+0x7a/0xb5
__asan_report_load8_noabort+0x19/0x20
memchr_inv+0x2ea/0x330
kernel_poison_pages+0x103/0x3d5
get_page_from_freelist+0x15e7/0x4d90
because KASAN has not yet unpoisoned the shadow page for allocation
before it checks memchr_inv() but only found a stale poison pattern.
Also, false positives in free path,
BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
Call Trace:
dump_stack+0xe0/0x19a
print_address_description.cold.2+0x9/0x28b
kasan_report.cold.3+0x7a/0xb5
check_memory_region+0x22d/0x250
memset+0x28/0x40
kernel_poison_pages+0x29e/0x3d5
__free_pages_ok+0x75f/0x13e0
due to KASAN adds poisoned redzones around slab objects, but the page
poisoning needs to poison the whole page.
Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use after scope bugs detector seems to be almost entirely useless for
the linux kernel. It exists over two years, but I've seen only one
valid bug so far [1]. And the bug was fixed before it has been
reported. There were some other use-after-scope reports, but they were
false-positives due to different reasons like incompatibility with
structleak plugin.
This feature significantly increases stack usage, especially with GCC <
9 version, and causes a 32K stack overflow. It probably adds
performance penalty too.
Given all that, let's remove use-after-scope detector entirely.
While preparing this patch I've noticed that we mistakenly enable
use-after-scope detection for clang compiler regardless of
CONFIG_KASAN_EXTRA setting. This is also fixed now.
[1] http://lkml.kernel.org/r/<20171129052106.rhgbjhhis53hkgfn@wfg-t540p.sh.intel.com>
Link: http://lkml.kernel.org/r/20190111185842.13978-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Will Deacon <will.deacon@arm.com> [arm64]
Cc: Qian Cai <cai@lca.pw>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When soft_offline_in_use_page() runs on a thp tail page after pmd is
split, we trigger the following VM_BUG_ON_PAGE():
Memory failure: 0x3755ff: non anonymous thp
__get_any_page: 0x3755ff: unknown zero refcount page type 2fffff80000000
Soft offlining pfn 0x34d805 at process virtual address 0x20fff000
page:ffffea000d360140 count:0 mapcount:0 mapping:0000000000000000 index:0x1
flags: 0x2fffff80000000()
raw: 002fffff80000000 ffffea000d360108 ffffea000d360188 0000000000000000
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
------------[ cut here ]------------
kernel BUG at ./include/linux/mm.h:519!
soft_offline_in_use_page() passed refcount and page lock from tail page
to head page, which is not needed because we can pass any subpage to
split_huge_page().
Naoya had fixed a similar issue in c3901e722b ("mm: hwpoison: fix thp
split handling in memory_failure()"). But he missed fixing soft
offline.
Link: http://lkml.kernel.org/r/1551452476-24000-1-git-send-email-zhongjiang@huawei.com
Fixes: 61f5d698cc ("mm: re-enable THP")
Signed-off-by: zhongjiang <zhongjiang@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we met this once, let fsck.f2fs clear this only.
Note that, this addresses all the subtle fault injection test.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Dan reported:
"We put an upper bound on ->write_io_size_bits but we don't have a lower
bound."
So let's add lower bound check for ->write_io_size_bits in parse_options().
[We don't allow configuring ->write_io_size_bits to zero, since at least
we need to fill one dummy page for aligned IO.]
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We use below condition to check inline_xattr_size boundary:
if (!F2FS_OPTION(sbi).inline_xattr_size ||
F2FS_OPTION(sbi).inline_xattr_size >=
DEF_ADDRS_PER_INODE -
F2FS_TOTAL_EXTRA_ATTR_SIZE -
DEF_INLINE_RESERVED_SIZE -
DEF_MIN_INLINE_SIZE)
There is there problems in that check:
- we should allow inline_xattr_size equaling to min size of inline
{data,dentry} area.
- F2FS_TOTAL_EXTRA_ATTR_SIZE and inline_xattr_size are based on
different size unit, previous one is 4 bytes, latter one is 1 bytes.
- DEF_MIN_INLINE_SIZE only indicate min size of inline data area,
however, we need to consider min size of inline dentry area as well,
minimal inline dentry should at least contain two entries: '.' and
'..', so that min inline_dentry size is 40 bytes.
.bitmap 1 * 1 = 1
.reserved 1 * 1 = 1
.dentry 11 * 2 = 22
.filename 8 * 2 = 16
total 40
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix below warning coming because of using mutex lock in atomic context.
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:98
in_atomic(): 1, irqs_disabled(): 0, pid: 585, name: sh
Preemption disabled at: __radix_tree_preload+0x28/0x130
Call trace:
dump_backtrace+0x0/0x2b4
show_stack+0x20/0x28
dump_stack+0xa8/0xe0
___might_sleep+0x144/0x194
__might_sleep+0x58/0x8c
mutex_lock+0x2c/0x48
f2fs_trace_pid+0x88/0x14c
f2fs_set_node_page_dirty+0xd0/0x184
Do not use f2fs_radix_tree_insert() to avoid doing cond_resched() with
spin_lock() acquired.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we changed lock from cp_rwsem to node_change, it solved
the deadlock issue which was caused by below race condition:
Thread A Thread B
- f2fs_setattr
- f2fs_lock_op -- read_lock
- dquot_transfer
- __dquot_transfer
- dquot_acquire
- commit_dqblk
- f2fs_quota_write
- f2fs_write_begin
- f2fs_write_failed
- write_checkpoint
- block_operations
- f2fs_lock_all -- write_lock
- f2fs_truncate_blocks
- f2fs_lock_op -- read_lock
But it breaks the sematics of cp_rwsem, in other callers like:
- f2fs_file_write_iter -> f2fs_write_begin -> f2fs_write_failed
- f2fs_direct_IO -> f2fs_write_failed
We allow to truncate dnode w/o cp_rwsem held, result in incorrect sit
bitmap update, which can cause further data corruption.
So this patch reverts previous fix implementation, and try to fix
deadlock by skipping calling f2fs_truncate_blocks() in f2fs_write_failed()
only for quota file, and keep the preallocated data/node in the tail of
quota file, we can expecte that the preallocated space can be used to
store quota info latter soon.
Fixes: af033b2aa8 ("f2fs: guarantee journalled quota data by checkpoint")
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When sb->s_root is NULL dput() will do nothing,
so jump to label 'free_node_inode' instead of lable
'free_root_inode' when failing from d_make_root().
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>