While running a compile on arm64, I hit a memory exposure
usercopy: kernel memory exposure attempt detected from fffffc0000f3b1a8 (buffer_head) (1 bytes)
------------[ cut here ]------------
kernel BUG at mm/usercopy.c:75!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in: ip6t_rpfilter ip6t_REJECT
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp
llc ebtable_nat ip6table_security ip6table_raw ip6table_nat
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
iptable_security iptable_raw iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
ebtable_filter ebtables ip6table_filter ip6_tables vfat fat xgene_edac
xgene_enet edac_core i2c_xgene_slimpro i2c_core at803x realtek xgene_dma
mdio_xgene gpio_dwapb gpio_xgene_sb xgene_rng mailbox_xgene_slimpro nfsd
auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c sdhci_of_arasan
sdhci_pltfm sdhci mmc_core xhci_plat_hcd gpio_keys
CPU: 0 PID: 19744 Comm: updatedb Tainted: G W 4.8.0-rc3-threadinfo+ #1
Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene Mustang Board, BIOS 3.06.12 Aug 12 2016
task: fffffe03df944c00 task.stack: fffffe00d128c000
PC is at __check_object_size+0x70/0x3f0
LR is at __check_object_size+0x70/0x3f0
...
[<fffffc00082b4280>] __check_object_size+0x70/0x3f0
[<fffffc00082cdc30>] filldir64+0x158/0x1a0
[<fffffc0000f327e8>] __fat_readdir+0x4a0/0x558 [fat]
[<fffffc0000f328d4>] fat_readdir+0x34/0x40 [fat]
[<fffffc00082cd8f8>] iterate_dir+0x190/0x1e0
[<fffffc00082cde58>] SyS_getdents64+0x88/0x120
[<fffffc0008082c70>] el0_svc_naked+0x24/0x28
fffffc0000f3b1a8 is a module address. Modules may have compiled in
strings which could get copied to userspace. In this instance, it
looks like "." which matches with a size of 1 byte. Extend the
is_vmalloc_addr check to be is_vmalloc_or_module_addr to cover
all possible cases.
Signed-off-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
During cgroup2 rollout into production, we started encountering css
refcount underflows and css access crashes in the memory controller.
Splitting the heavily shared css reference counter into logical users
narrowed the imbalance down to the cgroup2 socket memory accounting.
The problem turns out to be the per-cpu charge cache. Cgroup1 had a
separate socket counter, but the new cgroup2 socket accounting goes
through the common charge path that uses a shared per-cpu cache for all
memory that is being tracked. Those caches are safe against scheduling
preemption, but not against interrupts - such as the newly added packet
receive path. When cache draining is interrupted by network RX taking
pages out of the cache, the resuming drain operation will put references
of in-use pages, thus causing the imbalance.
Disable IRQs during all per-cpu charge cache operations.
Fixes: f7e1cb6ec5 ("mm: memcontrol: account socket memory in unified hierarchy memory controller")
Link: http://lkml.kernel.org/r/20160914194846.11153-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 62c230bc17 ("mm: add support for a filesystem to activate
swap files and use direct_IO for writing swap pages") replaced the
swap_aops dirty hook from __set_page_dirty_no_writeback() with
swap_set_page_dirty().
For normal cases without these special SWP flags code path falls back to
__set_page_dirty_no_writeback() so the behaviour is expected to be the
same as before.
But swap_set_page_dirty() makes use of the page_swap_info() helper to
get the swap_info_struct to check for the flags like SWP_FILE,
SWP_BLKDEV etc as desired for those features. This helper has
BUG_ON(!PageSwapCache(page)) which is racy and safe only for the
set_page_dirty_lock() path.
For the set_page_dirty() path which is often needed for cases to be
called from irq context, kswapd() can toggle the flag behind the back
while the call is getting executed when system is low on memory and
heavy swapping is ongoing.
This ends up with undesired kernel panic.
This patch just moves the check outside the helper to its users
appropriately to fix kernel panic for the described path. Couple of
users of helpers already take care of SwapCache condition so I skipped
them.
Link: http://lkml.kernel.org/r/1473460718-31013-1-git-send-email-santosh.shilimkar@oracle.com
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joe Perches <joe@perches.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rik van Riel <riel@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jens Axboe <axboe@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org> [4.7.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dump_page() uses page_mapcount() to get mapcount of the page.
page_mapcount() has VM_BUG_ON_PAGE(PageSlab(page)) as mapcount doesn't
make sense for slab pages and the field in struct page used for other
information.
It leads to recursion if dump_page() called for slub page and DEBUG_VM
is enabled:
dump_page() -> page_mapcount() -> VM_BUG_ON_PAGE() -> dump_page -> ...
Let's avoid calling page_mapcount() for slab pages in dump_page().
Link: http://lkml.kernel.org/r/20160908082137.131076-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 394e31d2ce ("mem-hotplug: alloc new page from a nearest
neighbor node when mem-offline") introduced new_node_page() for memory
hotplug.
In new_node_page(), the nid is cleared before calling
__alloc_pages_nodemask(). But if it is the only node of the system, and
the first round allocation fails, it will not be able to get memory from
an empty nodemask, and will trigger oom.
The patch checks whether it is the last node on the system, and if it
is, then don't clear the nid in the nodemask.
Fixes: 394e31d2ce ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
Link: http://lkml.kernel.org/r/1473044391.4250.19.camel@TP420
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Reported-by: John Allen <jallen@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull libnvdimm fixes from Dan Williams:
"nvdimm fixes for v4.8, two of them are tagged for -stable:
- Fix devm_memremap_pages() to use track_pfn_insert(). Otherwise,
DAX pmd mappings end up with an uncached pgprot, and unusable
performance for the device-dax interface. The device-dax interface
appeared in 4.7 so this is tagged for -stable.
- Fix a couple VM_BUG_ON() checks in the show_smaps() path to
understand DAX pmd entries. This fix is tagged for -stable.
- Fix a mis-merge of the nfit machine-check handler to flip the
polarity of an if() to match the final version of the patch that
Vishal sent for 4.8-rc1. Without this the nfit machine check
handler never detects / inserts new 'badblocks' entries which
applications use to identify lost portions of files.
- For test purposes, fix the nvdimm_clear_poison() path to operate on
legacy / simulated nvdimm memory ranges. Without this fix a test
can set badblocks, but never clear them on these ranges.
- Fix the range checking done by dax_dev_pmd_fault(). This is not
tagged for -stable since this problem is mitigated by specifying
aligned resources at device-dax setup time.
These patches have appeared in a next release over the past week. The
recent rebase you can see in the timestamps was to drop an invalid fix
as identified by the updated device-dax unit tests [1]. The -mm
touches have an ack from Andrew"
[1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs"
https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm: allow legacy (e820) pmem region to clear bad blocks
nfit, mce: Fix SPA matching logic in MCE handler
mm: fix cache mode of dax pmd mappings
mm: fix show_smap() for zone_device-pmd ranges
dax: fix mapping size check
Attempting to dump /proc/<pid>/smaps for a process with pmd dax mappings
currently results in the following VM_BUG_ONs:
kernel BUG at mm/huge_memory.c:1105!
task: ffff88045f16b140 task.stack: ffff88045be14000
RIP: 0010:[<ffffffff81268f9b>] [<ffffffff81268f9b>] follow_trans_huge_pmd+0x2cb/0x340
[..]
Call Trace:
[<ffffffff81306030>] smaps_pte_range+0xa0/0x4b0
[<ffffffff814c2755>] ? vsnprintf+0x255/0x4c0
[<ffffffff8123c46e>] __walk_page_range+0x1fe/0x4d0
[<ffffffff8123c8a2>] walk_page_vma+0x62/0x80
[<ffffffff81307656>] show_smap+0xa6/0x2b0
kernel BUG at fs/proc/task_mmu.c:585!
RIP: 0010:[<ffffffff81306469>] [<ffffffff81306469>] smaps_pte_range+0x499/0x4b0
Call Trace:
[<ffffffff814c2795>] ? vsnprintf+0x255/0x4c0
[<ffffffff8123c46e>] __walk_page_range+0x1fe/0x4d0
[<ffffffff8123c8a2>] walk_page_vma+0x62/0x80
[<ffffffff81307696>] show_smap+0xa6/0x2b0
These locations are sanity checking page flags that must be set for an
anonymous transparent huge page, but are not set for the zone_device
pages associated with dax mappings.
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
A custom allocator without __GFP_COMP that copies to userspace has been
found in vmw_execbuf_process[1], so this disables the page-span checker
by placing it behind a CONFIG for future work where such things can be
tracked down later.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1373326
Reported-by: Vinson Lee <vlee@freedesktop.org>
Fixes: f5509cc18d ("mm: Hardened usercopy")
Signed-off-by: Kees Cook <keescook@chromium.org>
KASAN allocates memory from the page allocator as part of
kmem_cache_free(), and that can reference current->mempolicy through any
number of allocation functions. It needs to be NULL'd out before the
final reference is dropped to prevent a use-after-free bug:
BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
...
Call Trace:
dump_stack
kasan_object_err
kasan_report_error
__asan_report_load2_noabort
alloc_pages_current <-- use after free
depot_save_stack
save_stack
kasan_slab_free
kmem_cache_free
__mpol_put <-- free
do_exit
This patch sets current->mempolicy to NULL before dropping the final
reference.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com
Fixes: cd11016e5f ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org> [4.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
of memory when booting a secondary kernel. Srikar Dronamraju reported
that multiple nodes may have no memory managed by the buddy allocator
but still return true for populated_zone().
Commit 1d82de618d ("mm, vmscan: make kswapd reclaim in terms of
nodes") was reported to cause kswapd to spin at 100% CPU usage when
fadump was enabled. The old code happened to deal with the situation of
a populated node with zero free pages by co-incidence but the current
code tries to reclaim populated zones without realising that is
impossible.
We cannot just convert populated_zone() as many existing users really
need to check for present_pages. This patch introduces a managed_zone()
helper and uses it in the few cases where it is critical that the check
is made for managed pages -- zonelist construction and page reclaim.
Link: http://lkml.kernel.org/r/20160831195104.GB8119@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A bugfix in v4.8-rc2 introduced a harmless warning when
CONFIG_MEMCG_SWAP is disabled but CONFIG_MEMCG is enabled:
mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not used [-Werror=unused-function]
static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
This moves the function inside of the #ifdef block that hides the
calling function, to avoid the warning.
Fixes: 1f47b61fb4 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
Link: http://lkml.kernel.org/r/20160824113733.2776701-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While adding proper userfaultfd_wp support with bits in pagetable and
swap entry to avoid false positives WP userfaults through swap/fork/
KSM/etc, I've been adding a framework that mostly mirrors soft dirty.
So I noticed in one place I had to add uffd_wp support to the pagetables
that wasn't covered by soft_dirty and I think it should have.
Example: in the THP migration code migrate_misplaced_transhuge_page()
pmd_mkdirty is called unconditionally after mk_huge_pmd.
entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
That sets soft dirty too (it's a false positive for soft dirty, the soft
dirty bit could be more finegrained and transfer the bit like uffd_wp
will do.. pmd/pte_uffd_wp() enforces the invariant that when it's set
pmd/pte_write is not set).
However in the THP split there's no unconditional pmd_mkdirty after
mk_huge_pmd and pte_swp_mksoft_dirty isn't called after the migration
entry is created. The code sets the dirty bit in the struct page
instead of setting it in the pagetable (which is fully equivalent as far
as the real dirty bit is concerned, as the whole point of pagetable bits
is to be eventually flushed out of to the page, but that is not
equivalent for the soft-dirty bit that gets lost in translation).
This was found by code review only and totally untested as I'm working
to actually replace soft dirty and I don't have time to test potential
soft dirty bugfixes as well :).
Transfer the soft_dirty from pmd to pte during THP splits.
This fix avoids losing the soft_dirty bit and avoids userland memory
corruption in the checkpoint.
Fixes: eef1b3ba05 ("thp: implement split_huge_pmd()")
Link: http://lkml.kernel.org/r/1471610515-30229-2-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
check_bogus_address() checked for pointer overflow using this expression,
where 'ptr' has type 'const void *':
ptr + n < ptr
Since pointer wraparound is undefined behavior, gcc at -O2 by default
treats it like the following, which would not behave as intended:
(long)n < 0
Fortunately, this doesn't currently happen for kernel code because kernel
code is compiled with -fno-strict-overflow. But the expression should be
fixed anyway to use well-defined integer arithmetic, since it could be
treated differently by different compilers in the future or could be
reported by tools checking for undefined behavior.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
With debugobjects enabled and using SLAB_DESTROY_BY_RCU, when a
kmem_cache_node is destroyed the call_rcu() may trigger a slab
allocation to fill the debug object pool (__debug_object_init:fill_pool).
Everywhere but during kmem_cache_destroy(), discard_slab() is performed
outside of the kmem_cache_node->list_lock and avoids a lockdep warning
about potential recursion:
=============================================
[ INFO: possible recursive locking detected ]
4.8.0-rc1-gfxbench+ #1 Tainted: G U
---------------------------------------------
rmmod/8895 is trying to acquire lock:
(&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811c80d7>] get_partial_node.isra.63+0x47/0x430
but task is already holding lock:
(&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811cbda4>] __kmem_cache_shutdown+0x54/0x320
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&n->list_lock)->rlock);
lock(&(&n->list_lock)->rlock);
*** DEADLOCK ***
May be due to missing lock nesting notation
5 locks held by rmmod/8895:
#0: (&dev->mutex){......}, at: driver_detach+0x42/0xc0
#1: (&dev->mutex){......}, at: driver_detach+0x50/0xc0
#2: (cpu_hotplug.dep_map){++++++}, at: get_online_cpus+0x2d/0x80
#3: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x3c/0x220
#4: (&(&n->list_lock)->rlock){-.-...}, at: __kmem_cache_shutdown+0x54/0x320
stack backtrace:
CPU: 6 PID: 8895 Comm: rmmod Tainted: G U 4.8.0-rc1-gfxbench+ #1
Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015
Call Trace:
__lock_acquire+0x1646/0x1ad0
lock_acquire+0xb2/0x200
_raw_spin_lock+0x36/0x50
get_partial_node.isra.63+0x47/0x430
___slab_alloc.constprop.67+0x1a7/0x3b0
__slab_alloc.isra.64.constprop.66+0x43/0x80
kmem_cache_alloc+0x236/0x2d0
__debug_object_init+0x2de/0x400
debug_object_activate+0x109/0x1e0
__call_rcu.constprop.63+0x32/0x2f0
call_rcu+0x12/0x20
discard_slab+0x3d/0x40
__kmem_cache_shutdown+0xdb/0x320
shutdown_cache+0x19/0x60
kmem_cache_destroy+0x1ae/0x220
i915_gem_load_cleanup+0x14/0x40 [i915]
i915_driver_unload+0x151/0x180 [i915]
i915_pci_remove+0x14/0x20 [i915]
pci_device_remove+0x34/0xb0
__device_release_driver+0x95/0x140
driver_detach+0xb6/0xc0
bus_remove_driver+0x53/0xd0
driver_unregister+0x27/0x50
pci_unregister_driver+0x25/0x70
i915_exit+0x1a/0x1e2 [i915]
SyS_delete_module+0x193/0x1f0
entry_SYSCALL_64_fastpath+0x1c/0xac
Fixes: 52b4b950b5 ("mm: slab: free kmem_cache_node after destroy sysfs file")
Link: http://lkml.kernel.org/r/1470759070-18743-1-git-send-email-chris@chris-wilson.co.uk
Reported-by: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Gordon <david.s.gordon@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The newly introduced shmem_huge_enabled() function has two definitions,
but neither of them is visible if CONFIG_SYSFS is disabled, leading to a
build error:
mm/khugepaged.o: In function `khugepaged':
khugepaged.c:(.text.khugepaged+0x3ca): undefined reference to `shmem_huge_enabled'
This changes the #ifdef guards around the definition to match those that
are used in the header file.
Fixes: e496cf3d78 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
Link: http://lkml.kernel.org/r/20160809123638.1357593-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg
in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
with __GFP_ACCOUNT, including those that aren't actually charged to any
cgroup, i.e. allocated from the root cgroup context. To avoid overhead
in case cgroups are not used, we only do that if memcg_kmem_enabled() is
true. The latter is set iff there are kmem-enabled memory cgroups
(online or offline). The root cgroup is not considered kmem-enabled.
As a result, if a page is allocated with __GFP_ACCOUNT for the root
cgroup when there are kmem-enabled memory cgroups and is freed after all
kmem-enabled memory cgroups were removed, e.g.
# no memory cgroups has been created yet, create one
mkdir /sys/fs/cgroup/memory/test
# run something allocating pages with __GFP_ACCOUNT, e.g.
# a program using pipe
dmesg | tail
# remove the memory cgroup
rmdir /sys/fs/cgroup/memory/test
we'll get bad page state bug complaining about page->_mapcount != -1:
BUG: Bad page state in process swapper/0 pfn:1fd945c
page:ffffea007f651700 count:0 mapcount:-511 mapping: (null) index:0x0
flags: 0x1000000000000000()
To avoid that, let's mark with PageKmemcg only those pages that are
actually charged to and hence pin a non-root memory cgroup.
Fixes: 4949148ad4 ("mm: charge/uncharge kmemcg from generic page allocator paths")
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit abf545484d changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.
Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block fixes from Jens Axboe:
"Here's the second round of block updates for this merge window.
It's a mix of fixes for changes that went in previously in this round,
and fixes in general. This pull request contains:
- Fixes for loop from Christoph
- A bdi vs gendisk lifetime fix from Dan, worth two cookies.
- A blk-mq timeout fix, when on frozen queues. From Gabriel.
- Writeback fix from Jan, ensuring that __writeback_single_inode()
does the right thing.
- Fix for bio->bi_rw usage in f2fs from me.
- Error path deadlock fix in blk-mq sysfs registration from me.
- Floppy O_ACCMODE fix from Jiri.
- Fix to the new bio op methods from Mike.
One more followup will be coming here, ensuring that we don't
propagate the block types outside of block. That, and a rename of
bio->bi_rw is coming right after -rc1 is cut.
- Various little fixes"
* 'for-linus' of git://git.kernel.dk/linux-block:
mm/block: convert rw_page users to bio op use
loop: make do_req_filebacked more robust
loop: don't try to use AIO for discards
blk-mq: fix deadlock in blk_mq_register_disk() error path
Include: blkdev: Removed duplicate 'struct request;' declaration.
Fixup direct bi_rw modifiers
block: fix bdi vs gendisk lifetime mismatch
blk-mq: Allow timeouts to run while queue is freezing
nbd: fix race in ioctl
block: fix use-after-free in seq file
f2fs: drop bio->bi_rw manual assignment
block: add missing group association in bio-cloning functions
blkcg: kill unused field nr_undestroyed_grps
writeback: Write dirty times for WB_SYNC_ALL writeback
floppy: fix open(O_ACCMODE) for ioctl-only open
Pull more powerpc updates from Michael Ellerman:
"These were delayed for various reasons, so I let them sit in next a
bit longer, rather than including them in my first pull request.
Fixes:
- Fix early access to cpu_spec relocation from Benjamin Herrenschmidt
- Fix incorrect event codes in power9-event-list from Madhavan Srinivasan
- Move register_process_table() out of ppc_md from Michael Ellerman
Use jump_label use for [cpu|mmu]_has_feature():
- Add mmu_early_init_devtree() from Michael Ellerman
- Move disable_radix handling into mmu_early_init_devtree() from Michael Ellerman
- Do hash device tree scanning earlier from Michael Ellerman
- Do radix device tree scanning earlier from Michael Ellerman
- Do feature patching before MMU init from Michael Ellerman
- Check features don't change after patching from Michael Ellerman
- Make MMU_FTR_RADIX a MMU family feature from Aneesh Kumar K.V
- Convert mmu_has_feature() to returning bool from Michael Ellerman
- Convert cpu_has_feature() to returning bool from Michael Ellerman
- Define radix_enabled() in one place & use static inline from Michael Ellerman
- Add early_[cpu|mmu]_has_feature() from Michael Ellerman
- Convert early cpu/mmu feature check to use the new helpers from Aneesh Kumar K.V
- jump_label: Make it possible for arches to invoke jump_label_init() earlier from Kevin Hao
- Call jump_label_init() in apply_feature_fixups() from Aneesh Kumar K.V
- Remove mfvtb() from Kevin Hao
- Move cpu_has_feature() to a separate file from Kevin Hao
- Add kconfig option to use jump labels for cpu/mmu_has_feature() from Michael Ellerman
- Add option to use jump label for cpu_has_feature() from Kevin Hao
- Add option to use jump label for mmu_has_feature() from Kevin Hao
- Catch usage of cpu/mmu_has_feature() before jump label init from Aneesh Kumar K.V
- Annotate jump label assembly from Michael Ellerman
TLB flush enhancements from Aneesh Kumar K.V:
- radix: Implement tlb mmu gather flush efficiently
- Add helper for finding SLBE LLP encoding
- Use hugetlb flush functions
- Drop multiple definition of mm_is_core_local
- radix: Add tlb flush of THP ptes
- radix: Rename function and drop unused arg
- radix/hugetlb: Add helper for finding page size
- hugetlb: Add flush_hugetlb_tlb_range
- remove flush_tlb_page_nohash
Add new ptrace regsets from Anshuman Khandual and Simon Guo:
- elf: Add powerpc specific core note sections
- Add the function flush_tmregs_to_thread
- Enable in transaction NT_PRFPREG ptrace requests
- Enable in transaction NT_PPC_VMX ptrace requests
- Enable in transaction NT_PPC_VSX ptrace requests
- Adapt gpr32_get, gpr32_set functions for transaction
- Enable support for NT_PPC_CGPR
- Enable support for NT_PPC_CFPR
- Enable support for NT_PPC_CVMX
- Enable support for NT_PPC_CVSX
- Enable support for TM SPR state
- Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
- Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
- Enable support for EBB registers
- Enable support for Performance Monitor registers"
* tag 'powerpc-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits)
powerpc/mm: Move register_process_table() out of ppc_md
powerpc/perf: Fix incorrect event codes in power9-event-list
powerpc/32: Fix early access to cpu_spec relocation
powerpc/ptrace: Enable support for Performance Monitor registers
powerpc/ptrace: Enable support for EBB registers
powerpc/ptrace: Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
powerpc/ptrace: Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
powerpc/ptrace: Enable support for TM SPR state
powerpc/ptrace: Enable support for NT_PPC_CVSX
powerpc/ptrace: Enable support for NT_PPC_CVMX
powerpc/ptrace: Enable support for NT_PPC_CFPR
powerpc/ptrace: Enable support for NT_PPC_CGPR
powerpc/ptrace: Adapt gpr32_get, gpr32_set functions for transaction
powerpc/ptrace: Enable in transaction NT_PPC_VSX ptrace requests
powerpc/ptrace: Enable in transaction NT_PPC_VMX ptrace requests
powerpc/ptrace: Enable in transaction NT_PRFPREG ptrace requests
powerpc/process: Add the function flush_tmregs_to_thread
elf: Add powerpc specific core note sections
powerpc/mm: remove flush_tlb_page_nohash
powerpc/mm/hugetlb: Add flush_hugetlb_tlb_range
...
It causes NULL dereference error and failure to get type_a->regions[0]
info if parameter type_b of __next_mem_range_rev() == NULL
Fix this by checking before dereferring and initializing idx_b to 0
The approach is tested by dumping all types of region via
__memblock_dump_all() and __next_mem_range_rev() fixed to UART
separately the result is okay after checking the logs.
Link: http://lkml.kernel.org/r/57A0320D.6070102@zoho.com
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Tested-by: zijun_hu <zijun_hu@htc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Mackerras and Reza Arbab reported that machines with memoryless
nodes fail when vmstats are refreshed. Paul reported an oops as follows
Unable to handle kernel paging request for data at address 0xff7a10000
Faulting instruction address: 0xc000000000270cd0
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.7.0-kvm+ #118
task: c000000ff0680010 task.stack: c000000ff0704000
NIP: c000000000270cd0 LR: c000000000270ce8 CTR: 0000000000000000
REGS: c000000ff0707900 TRAP: 0300 Not tainted (4.7.0-kvm+)
MSR: 9000000102009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE,TM[E]> CR: 846b6824 XER: 20000000
CFAR: c000000000008768 DAR: 0000000ff7a10000 DSISR: 42000000 SOFTE: 1
NIP refresh_zone_stat_thresholds+0x80/0x240
LR refresh_zone_stat_thresholds+0x98/0x240
Call Trace:
refresh_zone_stat_thresholds+0xb8/0x240 (unreliable)
Both supplied potential fixes but one potentially misses checks and
another had redundant initialisations. This version initialises
per_cpu_nodestats on a per-pgdat basis instead of on a per-zone basis.
Link: http://lkml.kernel.org/r/20160804092404.GI2799@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Paul Mackerras <paulus@ozlabs.org>
Reported-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The rw_page users were not converted to use bio/req ops. As a result
bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
be sent down as reads.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixes: 4e1b2d52a8 ("block, fs, drivers: remove REQ_OP compat defs and related code")
Modified by me to:
1) Drop op_flags passing into ->rw_page(), as we don't use it.
2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK
Signed-off-by: Jens Axboe <axboe@fb.com>
The name for a bdi of a gendisk is derived from the gendisk's devt.
However, since the gendisk is destroyed before the bdi it leaves a
window where a new gendisk could dynamically reuse the same devt while a
bdi with the same name is still live. Arrange for the bdi to hold a
reference against its "owner" disk device while it is registered.
Otherwise we can hit sysfs duplicate name collisions like the following:
WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'
Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
Call Trace:
[<ffffffff8134caec>] dump_stack+0x63/0x87
[<ffffffff8108c351>] __warn+0xd1/0xf0
[<ffffffff8108c3cf>] warn_slowpath_fmt+0x5f/0x80
[<ffffffff812a0d34>] sysfs_warn_dup+0x64/0x80
[<ffffffff812a0e1e>] sysfs_create_dir_ns+0x7e/0x90
[<ffffffff8134faaa>] kobject_add_internal+0xaa/0x320
[<ffffffff81358d4e>] ? vsnprintf+0x34e/0x4d0
[<ffffffff8134ff55>] kobject_add+0x75/0xd0
[<ffffffff816e66b2>] ? mutex_lock+0x12/0x2f
[<ffffffff8148b0a5>] device_add+0x125/0x610
[<ffffffff8148b788>] device_create_groups_vargs+0xd8/0x100
[<ffffffff8148b7cc>] device_create_vargs+0x1c/0x20
[<ffffffff811b775c>] bdi_register+0x8c/0x180
[<ffffffff811b7877>] bdi_register_dev+0x27/0x30
[<ffffffff813317f5>] add_disk+0x175/0x4a0
Cc: <stable@vger.kernel.org>
Reported-by: Yi Zhang <yizhan@redhat.com>
Tested-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Fixed up missing 0 return in bdi_register_owner().
Signed-off-by: Jens Axboe <axboe@fb.com>
If CONFIG_TRANSPARENT_HUGE_PAGECACHE=n, HPAGE_PMD_NR evaluates to
BUILD_BUG_ON(), and may cause (e.g. with gcc 4.12):
mm/built-in.o: In function `shmem_alloc_hugepage':
shmem.c:(.text+0x17570): undefined reference to `__compiletime_assert_1365'
To fix this, move the assignment to hindex after the check for huge
pages support.
Fixes: 800d8c63b2 ("shmem: add huge pages support")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge yet more updates from Andrew Morton:
- the rest of ocfs2
- various hotfixes, mainly MM
- quite a bit of misc stuff - drivers, fork, exec, signals, etc.
- printk updates
- firmware
- checkpatch
- nilfs2
- more kexec stuff than usual
- rapidio updates
- w1 things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
ipc: delete "nr_ipc_ns"
kcov: allow more fine-grained coverage instrumentation
init/Kconfig: add clarification for out-of-tree modules
config: add android config fragments
init/Kconfig: ban CONFIG_LOCALVERSION_AUTO with allmodconfig
relay: add global mode support for buffer-only channels
init: allow blacklisting of module_init functions
w1:omap_hdq: fix regression
w1: add helper macro module_w1_family
w1: remove need for ida and use PLATFORM_DEVID_AUTO
rapidio/switches: add driver for IDT gen3 switches
powerpc/fsl_rio: apply changes for RIO spec rev 3
rapidio: modify for rev.3 specification changes
rapidio: change inbound window size type to u64
rapidio/idt_gen2: fix locking warning
rapidio: fix error handling in mbox request/release functions
rapidio/tsi721_dma: advance queue processing from transfer submit call
rapidio/tsi721: add messaging mbox selector parameter
rapidio/tsi721: add PCIe MRRS override parameter
rapidio/tsi721_dma: add channel mask and queue size parameters
...
There was only one use of __initdata_refok and __exit_refok
__init_refok was used 46 times against 82 for __ref.
Those definitions are obsolete since commit 312b1485fb ("Introduce new
section reference annotations tags: __ref, __refdata, __refconst")
This patch removes the following compatibility definitions and replaces
them treewide.
/* compatibility defines */
#define __init_refok __ref
#define __initdata_refok __refdata
#define __exit_refok __ref
I can also provide separate patches if necessary.
(One patch per tree and check in 1 month or 2 to remove old definitions)
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>