Other than munmap, mremap might be used to shrink memory mapping too.
So, it may hold write mmap_sem for long time when shrinking large
mapping, as what commit ("mm: mmap: zap pages with read mmap_sem in
munmap") described.
The mremap() will not manipulate vmas anymore after __do_munmap() call for
the mapping shrink use case, so it is safe to downgrade to read mmap_sem.
So, the same optimization, which downgrades mmap_sem to read for zapping
pages, is also feasible and reasonable to this case.
The period of holding exclusive mmap_sem for shrinking large mapping
would be reduced significantly with this optimization.
MREMAP_FIXED and MREMAP_MAYMOVE are more complicated to adopt this
optimization since they need manipulate vmas after do_munmap(),
downgrading mmap_sem may create race window.
Simple mapping shrink is the low hanging fruit, and it may cover the
most cases of unmap with munmap together.
[akpm@linux-foundation.org: tweak comment]
[yang.shi@linux.alibaba.com: fix unsigned compare against 0 issue]
Link: http://lkml.kernel.org/r/1538687672-17795-2-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1538067582-60038-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ZONE_DEVICE pages were being initialized in two locations. One was
with the memory_hotplug lock held and another was outside of that lock.
The problem with this is that it was nearly doubling the memory
initialization time. Instead of doing this twice, once while holding a
global lock and once without, I am opting to defer the initialization to
the one outside of the lock. This allows us to avoid serializing the
overhead for memory init and we can instead focus on per-node init times.
One issue I encountered is that devm_memremap_pages and
hmm_devmmem_pages_create were initializing only the pgmap field the same
way. One wasn't initializing hmm_data, and the other was initializing it
to a poison value. Since this is something that is exposed to the driver
in the case of hmm I am opting for a third option and just initializing
hmm_data to 0 since this is going to be exposed to unknown third party
drivers.
[alexander.h.duyck@linux.intel.com: fix reference count for pgmap in devm_memremap_pages]
Link: http://lkml.kernel.org/r/20181008233404.1909.37302.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/20180925202053.3576.66039.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It doesn't make much sense to use the atomic SetPageReserved at init time
when we are using memset to clear the memory and manipulating the page
flags via simple "&=" and "|=" operations in __init_single_page.
This patch adds a non-atomic version __SetPageReserved that can be used
during page init and shows about a 10% improvement in initialization times
on the systems I have available for testing. On those systems I saw
initialization times drop from around 35 seconds to around 32 seconds to
initialize a 3TB block of persistent memory. I believe the main advantage
of this is that it allows for more compiler optimization as the __set_bit
operation can be reordered whereas the atomic version cannot.
I tried adding a bit of documentation based on f1dd2cd13c ("mm,
memory_hotplug: do not associate hotadded memory to zones until online").
Ideally the reserved flag should be set earlier since there is a brief
window where the page is initialization via __init_single_page and we have
not set the PG_Reserved flag. I'm leaving that for a future patch set as
that will require a more significant refactor.
Link: http://lkml.kernel.org/r/20180925202018.3576.11607.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Address issues slowing persistent memory initialization", v5.
The main thing this patch set achieves is that it allows us to initialize
each node worth of persistent memory independently. As a result we reduce
page init time by about 2 minutes because instead of taking 30 to 40
seconds per node and going through each node one at a time, we process all
4 nodes in parallel in the case of a 12TB persistent memory setup spread
evenly over 4 nodes.
This patch (of 3):
On systems with a large amount of memory it can take a significant amount
of time to initialize all of the page structs with the PAGE_POISON_PATTERN
value. I have seen it take over 2 minutes to initialize a system with
over 12TB of RAM.
In order to work around the issue I had to disable CONFIG_DEBUG_VM and
then the boot time returned to something much more reasonable as the
arch_add_memory call completed in milliseconds versus seconds. However in
doing that I had to disable all of the other VM debugging on the system.
In order to work around a kernel that might have CONFIG_DEBUG_VM enabled
on a system that has a large amount of memory I have added a new kernel
parameter named "vm_debug" that can be set to "-" in order to disable it.
Link: http://lkml.kernel.org/r/20180925201921.3576.84239.stgit@localhost.localdomain
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Delay accounting already measures the time a task spends in direct reclaim
and waiting for swapin, but in low memory situations tasks spend can spend
a significant amount of their time waiting on thrashing page cache. This
isn't tracked right now.
To know the full impact of memory contention on an individual task,
measure the delay when waiting for a recently evicted active cache page to
read back into memory.
Also update tools/accounting/getdelays.c:
[hannes@computer accounting]$ sudo ./getdelays -d -p 1
print delayacct stats ON
PID 1
CPU count real total virtual total delay total delay average
50318 745000000 847346785 400533713 0.008ms
IO count delay total delay average
435 122601218 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
0 0 0ms
THRASHING count delay total delay average
19 12621439 0ms
Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by
commit eb59254608 ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with
the goal of accounting objects that can be reclaimed, but cannot be
allocated via a SLAB_RECLAIM_ACCOUNT cache. This is now possible via
kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user
is converted.
The counter is however still useful for accounting direct page allocations
(i.e. not slab) with a shrinker, such as the ION page pool. So keep it,
and:
- change granularity to pages to be more like other counters; sub-page
allocations should be able to use kmalloc
- rename the counter to NR_KERNEL_MISC_RECLAIMABLE
- expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can
again remove the check for not printing "hidden" counters
Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc5 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I've noticed, that dying memory cgroups are often pinned in memory by a
single pagecache page. Even under moderate memory pressure they sometimes
stayed in such state for a long time. That looked strange.
My investigation showed that the problem is caused by applying the LRU
pressure balancing math:
scan = div64_u64(scan * fraction[lru], denominator),
where
denominator = fraction[anon] + fraction[file] + 1.
Because fraction[lru] is always less than denominator, if the initial scan
size is 1, the result is always 0.
This means the last page is not scanned and has
no chances to be reclaimed.
Fix this by rounding up the result of the division.
In practice this change significantly improves the speed of dying cgroups
reclaim.
[guro@fb.com: prevent double calculation of DIV64_U64_ROUND_UP() arguments]
Link: http://lkml.kernel.org/r/20180829213311.GA13501@castle
Link: http://lkml.kernel.org/r/20180827162621.30187-3-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If CONFIG_VMAP_STACK is set, kernel stacks are allocated using
__vmalloc_node_range() with __GFP_ACCOUNT. So kernel stack pages are
charged against corresponding memory cgroups on allocation and uncharged
on releasing them.
The problem is that we do cache kernel stacks in small per-cpu caches and
do reuse them for new tasks, which can belong to different memory cgroups.
Each stack page still holds a reference to the original cgroup, so the
cgroup can't be released until the vmap area is released.
To make this happen we need more than two subsequent exits without forks
in between on the current cpu, which makes it very unlikely to happen. As
a result, I saw a significant number of dying cgroups (in theory, up to 2
* number_of_cpu + number_of_tasks), which can't be released even by
significant memory pressure.
As a cgroup structure can take a significant amount of memory (first of
all, per-cpu data like memcg statistics), it leads to a noticeable waste
of memory.
Link: http://lkml.kernel.org/r/20180827162621.30187-1-guro@fb.com
Fixes: ac496bf48d ("fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tracing the event "fs_dax:dax_pmd_insert_mapping" with perf produces this
warning:
[fs_dax:dax_pmd_insert_mapping] unknown op '~'
It is printed in process_op (tools/lib/traceevent/event-parse.c) because
'~' is parsed as a binary operator.
perf reads the format of fs_dax:dax_pmd_insert_mapping ("print fmt") from
/sys/kernel/debug/tracing/events/fs_dax/dax_pmd_insert_mapping/format .
The format contains:
~(((u64) ~(~(((1UL) << 12)-1)))
^
\ interpreted as a binary operator by process_op().
This part is generated in the declaration of the event class
dax_pmd_insert_mapping_class in include/trace/events/fs_dax.h :
__print_flags_u64(__entry->pfn_val & PFN_FLAGS_MASK, "|",
PFN_FLAGS_TRACE),
This patch adds a pair of parentheses in the declaration of PFN_FLAGS_MASK
to make sure that '~' is parsed as a unary operator by perf.
The part of the format that was problematic is now:
~(((u64) (~(~(((1UL) << 12)-1))))
Now, all the '~' are parsed as unary operators.
Link: http://lkml.kernel.org/r/20181021145939.8760-1-sebhtml@videotron.qc.ca
Signed-off-by: Sebastien Boisvert <sebhtml@videotron.qc.ca>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Elenie Godzaridis <arangradient@gmail.com>
Cc: <stable@vger.kerenl.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull MIPS fixes from Paul Burton:
"A couple of MIPS fixes that should have ideally made it for v4.19, but
hey-ho here they are now:
- A fix for potential poor stack placement introduced in v4.19-rc8.
- A fix for a warning introduced in use of TURBOchannel devices by
DMA changes in v4.16"
* tag 'mips_fixes_4.20_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: VDSO: Reduce VDSO_RANDOMIZE_SIZE to 64MB for 64bit
TC: Set DMA masks for devices
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- A large series to rewrite our SLB miss handling, replacing a lot of
fairly complicated asm with much fewer lines of C.
- Following on from that, we now maintain a cache of SLB entries for
each process and preload them on context switch. Leading to a 27%
speedup for our context switch benchmark on Power9.
- Improvements to our handling of SLB multi-hit errors. We now print
more debug information when they occur, and try to continue running
by flushing the SLB and reloading, rather than treating them as
fatal.
- Enable THP migration on 64-bit Book3S machines (eg. Power7/8/9).
- Add support for physical memory up to 2PB in the linear mapping on
64-bit Book3S. We only support up to 512TB as regular system
memory, otherwise the percpu allocator runs out of vmalloc space.
- Add stack protector support for 32 and 64-bit, with a per-task
canary.
- Add support for PTRACE_SYSEMU and PTRACE_SYSEMU_SINGLESTEP.
- Support recognising "big cores" on Power9, where two SMT4 cores are
presented to us as a single SMT8 core.
- A large series to cleanup some of our ioremap handling and PTE
flags.
- Add a driver for the PAPR SCM (storage class memory) interface,
allowing guests to operate on SCM devices (acked by Dan).
- Changes to our ftrace code to handle very large kernels, where we
need to use a trampoline to get to ftrace_caller().
And many other smaller enhancements and cleanups.
Thanks to: Alan Modra, Alistair Popple, Aneesh Kumar K.V, Anton
Blanchard, Aravinda Prasad, Bartlomiej Zolnierkiewicz, Benjamin
Herrenschmidt, Breno Leitao, Cédric Le Goater, Christophe Leroy,
Christophe Lombard, Dan Carpenter, Daniel Axtens, Finn Thain, Gautham
R. Shenoy, Gustavo Romero, Haren Myneni, Hari Bathini, Jia Hongtao,
Joel Stanley, John Allen, Laurent Dufour, Madhavan Srinivasan, Mahesh
Salgaonkar, Mark Hairgrove, Masahiro Yamada, Michael Bringmann,
Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Nathan
Fontenot, Naveen N. Rao, Nicholas Piggin, Nick Desaulniers, Oliver
O'Halloran, Paul Mackerras, Petr Vorel, Rashmica Gupta, Reza Arbab,
Rob Herring, Sam Bobroff, Samuel Mendoza-Jonas, Scott Wood, Stan
Johnson, Stephen Rothwell, Stewart Smith, Suraj Jitindar Singh, Tyrel
Datwyler, Vaibhav Jain, Vasant Hegde, YueHaibing, zhong jiang"
* tag 'powerpc-4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (221 commits)
Revert "selftests/powerpc: Fix out-of-tree build errors"
powerpc/msi: Fix compile error on mpc83xx
powerpc: Fix stack protector crashes on CPU hotplug
powerpc/traps: restore recoverability of machine_check interrupts
powerpc/64/module: REL32 relocation range check
powerpc/64s/radix: Fix radix__flush_tlb_collapsed_pmd double flushing pmd
selftests/powerpc: Add a test of wild bctr
powerpc/mm: Fix page table dump to work on Radix
powerpc/mm/radix: Display if mappings are exec or not
powerpc/mm/radix: Simplify split mapping logic
powerpc/mm/radix: Remove the retry in the split mapping logic
powerpc/mm/radix: Fix small page at boundary when splitting
powerpc/mm/radix: Fix overuse of small pages in splitting logic
powerpc/mm/radix: Fix off-by-one in split mapping logic
powerpc/ftrace: Handle large kernel configs
powerpc/mm: Fix WARN_ON with THP NUMA migration
selftests/powerpc: Fix out-of-tree build errors
powerpc/time: no steal_time when CONFIG_PPC_SPLPAR is not selected
powerpc/time: Only set CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC64
powerpc/time: isolate scaled cputime accounting in dedicated functions.
...
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix the NFSv4.1 r/wsize sanity checking
- Reset the RPC/RDMA credit grant properly after a disconnect
- Fix a missed page unlock after pg_doio()
Features and optimisations:
- Overhaul of the RPC client socket code to eliminate a locking
bottleneck and reduce the latency when transmitting lots of
requests in parallel.
- Allow parallelisation of the RPCSEC_GSS encoding of an RPC request.
- Convert the RPC client socket receive code to use iovec_iter() for
improved efficiency.
- Convert several NFS and RPC lookup operations to use RCU instead of
taking global locks.
- Avoid the need for BH-safe locks in the RPC/RDMA back channel.
Bugfixes and cleanups:
- Fix lock recovery during NFSv4 delegation recalls
- Fix the NFSv4 + NFSv4.1 "lookup revalidate + open file" case.
- Fixes for the RPC connection metrics
- Various RPC client layer cleanups to consolidate stream based
sockets
- RPC/RDMA connection cleanups
- Simplify the RPC/RDMA cleanup after memory operation failures
- Clean ups for NFS v4.2 copy completion and NFSv4 open state
reclaim"
* tag 'nfs-for-4.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (97 commits)
SUNRPC: Convert the auth cred cache to use refcount_t
SUNRPC: Convert auth creds to use refcount_t
SUNRPC: Simplify lookup code
SUNRPC: Clean up the AUTH cache code
NFS: change sign of nfs_fh length
sunrpc: safely reallow resvport min/max inversion
nfs: remove redundant call to nfs_context_set_write_error()
nfs: Fix a missed page unlock after pg_doio()
SUNRPC: Fix a compile warning for cmpxchg64()
NFSv4.x: fix lock recovery during delegation recall
SUNRPC: use cmpxchg64() in gss_seq_send64_fetch_and_inc()
xprtrdma: Squelch a sparse warning
xprtrdma: Clean up xprt_rdma_disconnect_inject
xprtrdma: Add documenting comments
xprtrdma: Report when there were zero posted Receives
xprtrdma: Move rb_flags initialization
xprtrdma: Don't disable BH's in backchannel server
xprtrdma: Remove memory address of "ep" from an error message
xprtrdma: Rename rpcrdma_qp_async_error_upcall
xprtrdma: Simplify RPC wake-ups on connect
...
Pull device mapper updates from Mike Snitzer:
- The biggest change this cycle is to remove support for the legacy IO
path (.request_fn) from request-based DM.
Jens has already started preparing for complete removal of the legacy
IO path in 4.21 but this earlier removal of support from DM has been
coordinated with Jens (as evidenced by the commit being attributed to
him).
Making request-based DM exclussively blk-mq only cleans up that
portion of DM core quite nicely.
- Convert the thinp and zoned targets over to using refcount_t where
applicable.
- A couple fixes to the DM zoned target for refcounting and other races
buried in the implementation of metadata block creation and use.
- Small cleanups to remove redundant unlikely() around a couple
WARN_ON_ONCE().
- Simplify how dm-ioctl copies from userspace, eliminating some
potential for a malicious user trying to change the executed ioctl
after its processing has begun.
- Tweaked DM crypt target to use the DM device name when naming the
various workqueues created for a particular DM crypt device (makes
the N workqueues for a DM crypt device more easily understood and
enhances user's accounting capabilities at a glance via "ps")
- Small fixup to remove dead branch in DM writecache's memory_entry().
* tag 'for-4.20/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm writecache: remove disabled code in memory_entry()
dm zoned: fix various dmz_get_mblock() issues
dm zoned: fix metadata block ref counting
dm raid: avoid bitmap with raid4/5/6 journal device
dm crypt: make workqueue names device-specific
dm: add dm_table_device_name()
dm ioctl: harden copy_params()'s copy_from_user() from malicious users
dm: remove unnecessary unlikely() around WARN_ON_ONCE()
dm zoned: target: use refcount_t for dm zoned reference counters
dm thin: use refcount_t for thin_c reference counting
dm table: require that request-based DM be layered on blk-mq devices
dm: rename DM_TYPE_MQ_REQUEST_BASED to DM_TYPE_REQUEST_BASED
dm: remove legacy request-based IO path
Pull more block layer updates from Jens Axboe:
- Set of patches improving support for zoned devices. This was ready
before the merge window, but I was late in picking it up and hence it
missed the original pull request (Damien, Christoph)
- libata no link power management quirk addition for a Samsung drive
(Diego Viola)
- Fix for a performance regression in BFQ that went into this merge
window (Federico Motta)
- Fix for a missing dma mask setting return value check (Gustavo)
- Typo in the gdrom queue failure case (me)
- NULL pointer deref fix for xen-blkfront (Vasilis Liaskovitis)
- Fixing the get_rq trace point placement in blk-mq (Xiaoguang Wang)
- Removal of a set-but-not-read variable in cdrom (zhong jiang)
* tag 'for-linus-20181026' of git://git.kernel.dk/linux-block:
libata: Apply NOLPM quirk for SAMSUNG MZ7TD256HAFV-000L9
block, bfq: fix asymmetric scenarios detection
gdrom: fix mistake in assignment of error
blk-mq: place trace_block_getrq() in correct place
block: Introduce blk_revalidate_disk_zones()
block: add a report_zones method
block: Expose queue nr_zones in sysfs
block: Improve zone reset execution
block: Introduce BLKGETNRZONES ioctl
block: Introduce BLKGETZONESZ ioctl
block: Limit allocation of zone descriptors for report zones
block: Introduce blkdev_nr_zones() helper
scsi: sd_zbc: Fix sd_zbc_check_zones() error checks
scsi: sd_zbc: Reduce boot device scan and revalidate time
scsi: sd_zbc: Rearrange code
cdrom: remove set but not used variable 'tocuse'
skd: fix unchecked return values
xen/blkfront: avoid NULL blkfront_info dereference on device removal
Pull Devicetree updates from Rob Herring:
"A bit bigger than normal as I've been busy this cycle.
There's a few things with dependencies and a few things subsystem
maintainers didn't pick up, so I'm taking them thru my tree.
The fixes from Johan didn't get into linux-next, but they've been
waiting for some time now and they are what's left of what subsystem
maintainers didn't pick up.
Summary:
- Sync dtc with upstream version v1.4.7-14-gc86da84d30e4
- Work to get rid of direct accesses to struct device_node name and
type pointers in preparation for removing them. New helpers for
parsing DT cpu nodes and conversions to use the helpers. printk
conversions to %pOFn for printing DT node names. Most went thru
subystem trees, so this is the remainder.
- Fixes to DT child node lookups to actually be restricted to child
nodes instead of treewide.
- Refactoring of dtb targets out of arch code. This makes the support
more uniform and enables building all dtbs on c6x, microblaze, and
powerpc.
- Various DT binding updates for Renesas r8a7744 SoC
- Vendor prefixes for Facebook, OLPC
- Restructuring of some ARM binding docs moving some peripheral
bindings out of board/SoC binding files
- New "secure-chosen" binding for secure world settings on ARM
- Dual licensing of 2 DT IRQ binding headers"
* tag 'devicetree-for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (78 commits)
ARM: dt: relicense two DT binding IRQ headers
power: supply: twl4030-charger: fix OF sibling-node lookup
NFC: nfcmrvl_uart: fix OF child-node lookup
net: stmmac: dwmac-sun8i: fix OF child-node lookup
net: bcmgenet: fix OF child-node lookup
drm/msm: fix OF child-node lookup
drm/mediatek: fix OF sibling-node lookup
of: Add missing exports of node name compare functions
dt-bindings: Add OLPC vendor prefix
dt-bindings: misc: bk4: Add device tree binding for Liebherr's BK4 SPI bus
dt-bindings: thermal: samsung: Add SPDX license identifier
dt-bindings: clock: samsung: Add SPDX license identifiers
dt-bindings: timer: ostm: Add R7S9210 support
dt-bindings: phy: rcar-gen2: Add r8a7744 support
dt-bindings: can: rcar_can: Add r8a7744 support
dt-bindings: timer: renesas, cmt: Document r8a7744 CMT support
dt-bindings: watchdog: renesas-wdt: Document r8a7744 support
dt-bindings: thermal: rcar: Add device tree support for r8a7744
Documentation: dt: Add binding for /secure-chosen/stdout-path
dt-bindings: arm: zte: Move sysctrl bindings to their own doc
...
Pull more dma-mapping updates from Christoph Hellwig:
- various swiotlb cleanups
- do not dip into the ѕwiotlb pool for dma coherent allocations
- add support for not cache coherent DMA to swiotlb
- switch ARM64 to use the generic swiotlb_dma_ops
* tag 'dma-mapping-4.20-1' of git://git.infradead.org/users/hch/dma-mapping:
arm64: use the generic swiotlb_dma_ops
swiotlb: add support for non-coherent DMA
swiotlb: don't dip into swiotlb pool for coherent allocations
swiotlb: refactor swiotlb_map_page
swiotlb: use swiotlb_map_page in swiotlb_map_sg_attrs
swiotlb: merge swiotlb_unmap_page and unmap_single
swiotlb: remove the overflow buffer
swiotlb: do not panic on mapping failures
swiotlb: mark is_swiotlb_buffer static
swiotlb: remove a pointless comment
Pull IOMMU updates from Joerg Roedel:
- Debugfs support for the Intel VT-d driver.
When enabled, it now also exposes some of its internal data
structures to user-space for debugging purposes.
- ARM-SMMU driver now uses the generic deferred flushing and fast-path
iova allocation code.
This is expected to be a major performance improvement, as this
allocation path scales a lot better.
- Support for r8a7744 in the Renesas iommu driver
- Couple of minor fixes and improvements all over the place
* tag 'iommu-updates-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (39 commits)
iommu/arm-smmu-v3: Remove unnecessary wrapper function
iommu/arm-smmu-v3: Add SPDX header
iommu/amd: Add default branch in amd_iommu_capable()
dt-bindings: iommu: ipmmu-vmsa: Add r8a7744 support
iommu/amd: Move iommu_init_pci() to .init section
iommu/arm-smmu: Support non-strict mode
iommu/io-pgtable-arm-v7s: Add support for non-strict mode
iommu/arm-smmu-v3: Add support for non-strict mode
iommu/io-pgtable-arm: Add support for non-strict mode
iommu: Add "iommu.strict" command line option
iommu/dma: Add support for non-strict mode
iommu/arm-smmu: Ensure that page-table updates are visible before TLBI
iommu/arm-smmu-v3: Implement flush_iotlb_all hook
iommu/arm-smmu-v3: Avoid back-to-back CMD_SYNC operations
iommu/arm-smmu-v3: Fix unexpected CMD_SYNC timeout
iommu/io-pgtable-arm: Fix race handling in split_blk_unmap()
iommu/arm-smmu-v3: Fix a couple of minor comment typos
iommu: Fix a typo
iommu: Remove .domain_{get,set}_windows
iommu: Tidy up window attributes
...
Pull char/misc driver updates from Greg KH:
"Here is the big set of char/misc patches for 4.20-rc1.
Loads of things here, we have new code in all of these driver
subsystems:
- fpga
- stm
- extcon
- nvmem
- eeprom
- hyper-v
- gsmi
- coresight
- thunderbolt
- vmw_balloon
- goldfish
- soundwire
along with lots of fixes and minor changes to other small drivers.
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (245 commits)
Documentation/security-bugs: Clarify treatment of embargoed information
lib: Fix ia64 bootloader linkage
MAINTAINERS: Clarify UIO vs UIOVEC maintainer
docs/uio: fix a grammar nitpick
docs: fpga: document programming fpgas using regions
fpga: add devm_fpga_region_create
fpga: bridge: add devm_fpga_bridge_create
fpga: mgr: add devm_fpga_mgr_create
hv_balloon: Replace spin_is_locked() with lockdep
sgi-xp: Replace spin_is_locked() with lockdep
eeprom: New ee1004 driver for DDR4 memory
eeprom: at25: remove unneeded 'at25_remove'
w1: IAD Register is yet readable trough iad sys file. Fix snprintf (%u for unsigned, count for max size).
misc: mic: scif: remove set but not used variables 'src_dma_addr, dst_dma_addr'
misc: mic: fix a DMA pool free failure
platform: goldfish: pipe: Add a blank line to separate varibles and code
platform: goldfish: pipe: Remove redundant casting
platform: goldfish: pipe: Call misc_deregister if init fails
platform: goldfish: pipe: Move the file-scope goldfish_pipe_dev variable into the driver state
platform: goldfish: pipe: Move the file-scope goldfish_pipe_miscdev variable into the driver state
...
Pull driver core updates from Greg KH:
"Here is a small number of driver core patches for 4.20-rc1.
Not much happened here this merge window, only a very tiny number of
patches that do:
- add BUS_ATTR_WO() for use by drivers
- component error path fixes
- kernfs range check fix
- other tiny error path fixes and const changes
All of these have been in linux-next with no reported issues for a
while"
* tag 'driver-core-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
devres: provide devm_kstrdup_const()
mm: move is_kernel_rodata() to asm-generic/sections.h
devres: constify p in devm_kfree()
driver core: add BUS_ATTR_WO() macro
kernfs: Fix range checks in kernfs_get_target_path
component: fix loop condition to call unbind() if bind() fails
drivers/base/devtmpfs.c: don't pretend path is const in delete_path
kernfs: update comment about kernfs_path() return value
Pull USB/PHY updates from Greg KH:
"Here is the big USB/PHY driver patches for 4.20-rc1
Lots of USB changes in here, primarily in these areas:
- typec updates and new drivers
- new PHY drivers
- dwc2 driver updates and additions (this old core keeps getting
added to new devices.)
- usbtmc major update based on the industry group coming together and
working to add new features and performance to the driver.
- USB gadget additions for new features
- USB gadget configfs updates
- chipidea driver updates
- other USB gadget updates
- USB serial driver updates
- renesas driver updates
- xhci driver updates
- other tiny USB driver updates
All of these have been in linux-next for a while with no reported
issues"
* tag 'usb-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (229 commits)
usb: phy: ab8500: silence some uninitialized variable warnings
usb: xhci: tegra: Add genpd support
usb: xhci: tegra: Power-off power-domains on removal
usbip:vudc: BUG kmalloc-2048 (Not tainted): Poison overwritten
usbip: tools: fix atoi() on non-null terminated string
USB: misc: appledisplay: fix backlight update_status return code
phy: phy-pxa-usb: add a new driver
usb: host: add DT bindings for faraday fotg2
usb: host: ohci-at91: fix request of irq for optional gpio
usb/early: remove set but not used variable 'remain_length'
usb: typec: Fix copy/paste on typec_set_vconn_role() kerneldoc
usb: typec: tcpm: Report back negotiated PPS voltage and current
USB: core: remove set but not used variable 'udev'
usb: core: fix memory leak on port_dev_path allocation
USB: net2280: Remove ->disconnect() callback from net2280_pullup()
usb: dwc2: disable power_down on rockchip devices
usb: gadget: udc: renesas_usb3: add support for r8a77990
dt-bindings: usb: renesas_usb3: add bindings for r8a77990
usb: gadget: udc: renesas_usb3: Add r8a774a1 support
USB: serial: cypress_m8: remove set but not used variable 'iflag'
...