If the system has more than one swap device and swap device has the node
information, we can make use of this information to decide which swap
device to use in get_swap_pages() to get better performance.
The current code uses a priority based list, swap_avail_list, to decide
which swap device to use and if multiple swap devices share the same
priority, they are used round robin. This patch changes the previous
single global swap_avail_list into a per-numa-node list, i.e. for each
numa node, it sees its own priority based list of available swap
devices. Swap device's priority can be promoted on its matching node's
swap_avail_list.
The current swap device's priority is set as: user can set a >=0 value,
or the system will pick one starting from -1 then downwards. The
priority value in the swap_avail_list is the negated value of the swap
device's due to plist being sorted from low to high. The new policy
doesn't change the semantics for priority >=0 cases, the previous
starting from -1 then downwards now becomes starting from -2 then
downwards and -1 is reserved as the promoted value.
Take 4-node EX machine as an example, suppose 4 swap devices are
available, each sit on a different node:
swapA on node 0
swapB on node 1
swapC on node 2
swapD on node 3
After they are all swapped on in the sequence of ABCD.
Current behaviour:
their priorities will be:
swapA: -1
swapB: -2
swapC: -3
swapD: -4
And their position in the global swap_avail_list will be:
swapA -> swapB -> swapC -> swapD
prio:1 prio:2 prio:3 prio:4
New behaviour:
their priorities will be(note that -1 is skipped):
swapA: -2
swapB: -3
swapC: -4
swapD: -5
And their positions in the 4 swap_avail_lists[nid] will be:
swap_avail_lists[0]: /* node 0's available swap device list */
swapA -> swapB -> swapC -> swapD
prio:1 prio:3 prio:4 prio:5
swap_avali_lists[1]: /* node 1's available swap device list */
swapB -> swapA -> swapC -> swapD
prio:1 prio:2 prio:4 prio:5
swap_avail_lists[2]: /* node 2's available swap device list */
swapC -> swapA -> swapB -> swapD
prio:1 prio:2 prio:3 prio:5
swap_avail_lists[3]: /* node 3's available swap device list */
swapD -> swapA -> swapB -> swapC
prio:1 prio:2 prio:3 prio:4
To see the effect of the patch, a test that starts N process, each mmap
a region of anonymous memory and then continually write to it at random
position to trigger both swap in and out is used.
On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
are used as swap devices with each attached to a different node, the
result is:
runtime=30m/processes=32/total test size=128G/each process mmap region=4G
kernel throughput
vanilla 13306
auto-binding 15169 +14%
runtime=30m/processes=64/total test size=128G/each process mmap region=2G
kernel throughput
vanilla 11885
auto-binding 14879 +25%
[aaron.lu@intel.com: v2]
Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
[akpm@linux-foundation.org: use kmalloc_array()]
Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
VMA based swap readahead will readahead the virtual pages that is
continuous in the virtual address space. While the original swap
readahead will readahead the swap slots that is continuous in the swap
device. Although VMA based swap readahead is more correct for the swap
slots to be readahead, it will trigger more small random readings, which
may cause the performance of HDD (hard disk) to degrade heavily, and may
finally exceed the benefit.
To avoid the issue, in this patch, if the HDD is used as swap, the VMA
based swap readahead will be disabled, and the original swap readahead
will be used instead.
Link: http://lkml.kernel.org/r/20170807054038.1843-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The swap readahead is an important mechanism to reduce the swap in
latency. Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.
In the original swap readahead implementation, the consecutive blocks in
swap device are readahead based on the global space locality estimation.
But the consecutive blocks in swap device just reflect the order of page
reclaiming, don't necessarily reflect the access pattern in virtual
memory. And the different tasks in the system may have different access
patterns, which makes the global space locality estimation incorrect.
In this patch, when page fault occurs, the virtual pages near the fault
address will be readahead instead of the swap slots near the fault swap
slot in swap device. This avoid to readahead the unrelated swap slots.
At the same time, the swap readahead is changed to work on per-VMA from
globally. So that the different access patterns of the different VMAs
could be distinguished, and the different readahead policy could be
applied accordingly. The original core readahead detection and scaling
algorithm is reused, because it is an effect algorithm to detect the
space locality.
The test and result is as follow,
Common test condition
=====================
Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
NVMe disk
Micro-benchmark with combined access pattern
============================================
vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds. The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.
At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds. This will trigger random swap-in
in the background.
This is a combined workload with sequential and random memory accessing
at the same time. The result (for sequential workload) is as follow,
Base Optimized
---- ---------
throughput 345413 KB/s 414029 KB/s (+19.9%)
latency.average 97.14 us 61.06 us (-37.1%)
latency.50th 2 us 1 us
latency.60th 2 us 1 us
latency.70th 98 us 2 us
latency.80th 160 us 2 us
latency.90th 260 us 217 us
latency.95th 346 us 369 us
latency.99th 1.34 ms 1.09 ms
ra_hit% 52.69% 99.98%
The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower. The VMA-base
readahead algorithm works much better.
Linpack
=======
The test memory size is bigger than RAM to trigger swapping.
Base Optimized
---- ---------
elapsed_time 393.49 s 329.88 s (-16.2%)
ra_hit% 86.21% 98.82%
The score of base and optimized kernel hasn't visible changes. But the
elapsed time reduced and readahead hit rate improved, so the optimized
kernel runs better for startup and tear down stages. And the absolute
value of readahead hit rate is high, shows that the space locality is
still valid in some practical workloads.
Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm, swap: VMA based swap readahead", v4.
The swap readahead is an important mechanism to reduce the swap in
latency. Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.
In the original swap readahead implementation, the consecutive blocks in
swap device are readahead based on the global space locality estimation.
But the consecutive blocks in swap device just reflect the order of page
reclaiming, don't necessarily reflect the access pattern in virtual
memory space. And the different tasks in the system may have different
access patterns, which makes the global space locality estimation
incorrect.
In this patchset, when page fault occurs, the virtual pages near the
fault address will be readahead instead of the swap slots near the fault
swap slot in swap device. This avoid to readahead the unrelated swap
slots. At the same time, the swap readahead is changed to work on
per-VMA from globally. So that the different access patterns of the
different VMAs could be distinguished, and the different readahead
policy could be applied accordingly. The original core readahead
detection and scaling algorithm is reused, because it is an effect
algorithm to detect the space locality.
In addition to the swap readahead changes, some new sysfs interface is
added to show the efficiency of the readahead algorithm and some other
swap statistics.
This new implementation will incur more small random read, on SSD, the
improved correctness of estimation and readahead target should beat the
potential increased overhead, this is also illustrated in the test
results below. But on HDD, the overhead may beat the benefit, so the
original implementation will be used by default.
The test and result is as follow,
Common test condition
=====================
Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
Swap device: NVMe disk
Micro-benchmark with combined access pattern
============================================
vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds. The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.
At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds. This will trigger random swap-in
in the background.
This is a combined workload with sequential and random memory accessing
at the same time. The result (for sequential workload) is as follow,
Base Optimized
---- ---------
throughput 345413 KB/s 414029 KB/s (+19.9%)
latency.average 97.14 us 61.06 us (-37.1%)
latency.50th 2 us 1 us
latency.60th 2 us 1 us
latency.70th 98 us 2 us
latency.80th 160 us 2 us
latency.90th 260 us 217 us
latency.95th 346 us 369 us
latency.99th 1.34 ms 1.09 ms
ra_hit% 52.69% 99.98%
The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower. The VMA-base
readahead algorithm works much better.
Linpack
=======
The test memory size is bigger than RAM to trigger swapping.
Base Optimized
---- ---------
elapsed_time 393.49 s 329.88 s (-16.2%)
ra_hit% 86.21% 98.82%
The score of base and optimized kernel hasn't visible changes. But the
elapsed time reduced and readahead hit rate improved, so the optimized
kernel runs better for startup and tear down stages. And the absolute
value of readahead hit rate is high, shows that the space locality is
still valid in some practical workloads.
This patch (of 5):
The statistics for total readahead pages and total readahead hits are
recorded and exported via the following sysfs interface.
/sys/kernel/mm/swap/ra_hits
/sys/kernel/mm/swap/ra_total
With them, the efficiency of the swap readahead could be measured, so
that the swap readahead algorithm and parameters could be tuned
accordingly.
[akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The normal swap slot reclaiming can be done when the swap count reaches
SWAP_HAS_CACHE. But for the swap slot which is backing a THP, all swap
slots backing one THP must be reclaimed together, because the swap slot
may be used again when the THP is swapped out again later. So the swap
slots backing one THP can be reclaimed together when the swap count for
all swap slots for the THP reached SWAP_HAS_CACHE. In the patch, the
functions to check whether the swap count for all swap slots backing one
THP reached SWAP_HAS_CACHE are implemented and used when checking
whether a swap slot can be reclaimed.
To make it easier to determine whether a swap slot is backing a THP, a
new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
cluster which is backing a THP (Transparent Huge Page). Because THP
swap in as a whole isn't supported now. After deleting the THP from the
swap cache (for example, swapping out finished), the CLUSTER_FLAG_HUGE
flag will be cleared. So that, the normal pages inside THP can be
swapped in individually.
[ying.huang@intel.com: fix swap_page_trans_huge_swapped on HDD]
Link: http://lkml.kernel.org/r/874ltsm0bi.fsf@yhuang-dev.intel.com
Link: http://lkml.kernel.org/r/20170724051840.2309-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement a variant of find_get_pages() that stops iterating at given
index. This may be substantial performance gain if the mapping is
sparse. See following commit for details. Furthermore lots of users of
this function (through pagevec_lookup()) actually want a range lookup
and all of them are currently open-coding this.
Also create corresponding pagevec_lookup_range() function.
Link: http://lkml.kernel.org/r/20170726114704.7626-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Ranged pagevec lookup", v2.
In this series I make pagevec_lookup() update the index (to be
consistent with pagevec_lookup_tag() and also as a preparation for
ranged lookups), provide ranged variant of pagevec_lookup() and use it
in places where it makes sense. This not only removes some common code
but is also a measurable performance win for some use cases (see patch
4/10) where radix tree is sparse and searching & grabing of a page after
the end of the range has measurable overhead.
This patch (of 10):
The callback doesn't ever get called. Remove it.
Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zonelists_mutex was introduced by commit 4eaf3f6439 ("mem-hotplug: fix
potential race while building zonelist for new populated zone") to
protect zonelist building from races. This is no longer needed though
because both memory online and offline are fully serialized. New users
have grown since then.
Notably setup_per_zone_wmarks wants to prevent from races between memory
hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
(see cfd3da1e49 ("mm: Serialize access to min_free_kbytes"). Let's
add a private lock for that purpose. This will not prevent from seeing
halfway through memory hotplug operation but that shouldn't be a big
deal becuse memory hotplug will update watermarks explicitly so we will
eventually get a full picture. The lock just makes sure we won't race
when updating watermarks leading to weird results.
Also __build_all_zonelists manipulates global data so add a private lock
for it as well. This doesn't seem to be necessary today but it is more
robust to have a lock there.
While we are at it make sure we document that memory online/offline
depends on a full serialization either via mem_hotplug_begin() or
device_lock.
Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Haicheng Li <haicheng.li@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "cleanup zonelists initialization", v1.
This is aimed at cleaning up the zonelists initialization code we have
but the primary motivation was bug report [2] which got resolved but the
usage of stop_machine is just too ugly to live. Most patches are
straightforward but 3 of them need a special consideration.
Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
because this is a user visible change. As I argue in the patch
description I do not think we have a strong usecase for it these days.
I have kept sysctl in place and warn into the log if somebody tries to
configure zone lists ordering. If somebody has a real usecase for it we
can revert this patch but I do not expect anybody will actually notice
runtime differences. This patch is not strictly needed for the rest but
it made patch 6 easier to implement.
Patch 7 removes stop_machine from build_all_zonelists without adding any
special synchronization between iterators and updater which I _believe_
is acceptable as explained in the changelog. I hope I am not missing
anything.
Patch 8 then removes zonelists_mutex which is kind of ugly as well and
not really needed AFAICS but a care should be taken when double checking
my thinking.
This patch (of 9):
Supporting zone ordered zonelists costs us just a lot of code while the
usefulness is arguable if existent at all. Mel has already made node
ordering default on 64b systems. 32b systems are still using
ZONELIST_ORDER_ZONE because it is considered better to fallback to a
different NUMA node rather than consume precious lowmem zones.
This argument is, however, weaken by the fact that the memory reclaim
has been reworked to be node rather than zone oriented. This means that
lowmem requests have to skip over all highmem pages on LRUs already and
so zone ordering doesn't save the reclaim time much. So the only
advantage of the zone ordering is under a light memory pressure when
highmem requests do not ever hit into lowmem zones and the lowmem
pressure doesn't need to reclaim.
Considering that 32b NUMA systems are rather suboptimal already and it
is generally advisable to use 64b kernel on such a HW I believe we
should rather care about the code maintainability and just get rid of
ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
somebody tries to set zone ordering either from kernel command line or
the sysctl.
[mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prior to commit f1dd2cd13c ("mm, memory_hotplug: do not associate
hotadded memory to zones until online") we used to allow to change the
valid zone types of a memory block if it is adjacent to a different zone
type.
This fact was reflected in memoryNN/valid_zones by the ordering of
printed zones. The first one was default (echo online > memoryNN/state)
and the other one could be onlined explicitly by online_{movable,kernel}.
This behavior was removed by the said patch and as such the ordering was
not all that important. In most cases a kernel zone would be default
anyway. The only exception is movable_node handled by "mm,
memory_hotplug: support movable_node for hotpluggable nodes".
Let's reintroduce this behavior again because later patch will remove
the zone overlap restriction and so user will be allowed to online
kernel resp. movable block regardless of its placement. Original
behavior will then become significant again because it would be
non-trivial for users to see what is the default zone to online into.
Implementation is really simple. Pull out zone selection out of
move_pfn_range into zone_for_pfn_range helper and use it in
show_valid_zones to display the zone for default onlining and then both
kernel and movable if they are allowed. Default online zone is not
duplicated.
Link: http://lkml.kernel.org/r/20170714121233.16861-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Kani Toshimitsu <toshi.kani@hpe.com>
Cc: <slaoub@gmail.com>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This SLUB free list pointer obfuscation code is modified from Brad
Spengler/PaX Team's code in the last public patch of grsecurity/PaX
based on my understanding of the code. Changes or omissions from the
original code are mine and don't reflect the original grsecurity/PaX
code.
This adds a per-cache random value to SLUB caches that is XORed with
their freelist pointer address and value. This adds nearly zero
overhead and frustrates the very common heap overflow exploitation
method of overwriting freelist pointers.
A recent example of the attack is written up here:
http://cyseclabs.com/blog/cve-2016-6187-heap-off-by-one-exploit
and there is a section dedicated to the technique the book "A Guide to
Kernel Exploitation: Attacking the Core".
This is based on patches by Daniel Micay, and refactored to minimize the
use of #ifdef.
With 200-count cycles of "hackbench -g 20 -l 1000" I saw the following
run times:
before:
mean 10.11882499999999999995
variance .03320378329145728642
stdev .18221905304181911048
after:
mean 10.12654000000000000014
variance .04700556623115577889
stdev .21680767106160192064
The difference gets lost in the noise, but if the above is to be taken
literally, using CONFIG_FREELIST_HARDENED is 0.07% slower.
Link: http://lkml.kernel.org/r/20170802180609.GA66807@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Daniel Micay <danielmicay@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tycho Andersen <tycho@docker.com>
Cc: Alexander Popov <alex.popov@linux.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When servicing mmap() reads from file holes the current DAX code
allocates a page cache page of all zeroes and places the struct page
pointer in the mapping->page_tree radix tree.
This has three major drawbacks:
1) It consumes memory unnecessarily. For every 4k page that is read via
a DAX mmap() over a hole, we allocate a new page cache page. This
means that if you read 1GiB worth of pages, you end up using 1GiB of
zeroed memory. This is easily visible by looking at the overall
memory consumption of the system or by looking at /proc/[pid]/smaps:
7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12 /root/dax/data
Size: 1048576 kB
Rss: 1048576 kB
Pss: 1048576 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 1048576 kB
Private_Dirty: 0 kB
Referenced: 1048576 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
2) It is slower than using a common zero page because each page fault
has more work to do. Instead of just inserting a common zero page we
have to allocate a page cache page, zero it, and then insert it. Here
are the average latencies of dax_load_hole() as measured by ftrace on
a random test box:
Old method, using zeroed page cache pages: 3.4 us
New method, using the common 4k zero page: 0.8 us
This was the average latency over 1 GiB of sequential reads done by
this simple fio script:
[global]
size=1G
filename=/root/dax/data
fallocate=none
[io]
rw=read
ioengine=mmap
3) The fact that we had to check for both DAX exceptional entries and
for page cache pages in the radix tree made the DAX code more
complex.
Solve these issues by following the lead of the DAX PMD code and using a
common 4k zero page instead. As with the PMD code we will now insert a
DAX exceptional entry into the radix tree instead of a struct page
pointer which allows us to remove all the special casing in the DAX
code.
Note that we do still pretty aggressively check for regular pages in the
DAX radix tree, especially where we take action based on the bits set in
the page. If we ever find a regular page in our radix tree now that
most likely means that someone besides DAX is inserting pages (which has
happened lots of times in the past), and we want to find that out early
and fail loudly.
This solution also removes the extra memory consumption. Here is that
same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
code:
7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12 /root/dax/data
Size: 1048576 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
Overall system memory consumption is similarly improved.
Another major change is that we remove dax_pfn_mkwrite() from our fault
flow, and instead rely on the page fault itself to make the PTE dirty
and writeable. The following description from the patch adding the
vm_insert_mixed_mkwrite() call explains this a little more:
"To be able to use the common 4k zero page in DAX we need to have our
PTE fault path look more like our PMD fault path where a PTE entry
can be marked as dirty and writeable as it is first inserted rather
than waiting for a follow-up dax_pfn_mkwrite() =>
finish_mkwrite_fault() call.
Right now we can rely on having a dax_pfn_mkwrite() call because we
can distinguish between these two cases in do_wp_page():
case 1: 4k zero page => writable DAX storage
case 2: read-only DAX storage => writeable DAX storage
This distinction is made by via vm_normal_page(). vm_normal_page()
returns false for the common 4k zero page, though, just as it does
for DAX ptes. Instead of special casing the DAX + 4k zero page case
we will simplify our DAX PTE page fault sequence so that it matches
our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
We will instead use dax_iomap_fault() to handle write-protection
faults.
This means that insert_pfn() needs to follow the lead of
insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
'mkwrite' is set insert_pfn() will do the work that was previously
done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When servicing mmap() reads from file holes the current DAX code
allocates a page cache page of all zeroes and places the struct page
pointer in the mapping->page_tree radix tree. This has three major
drawbacks:
1) It consumes memory unnecessarily. For every 4k page that is read via
a DAX mmap() over a hole, we allocate a new page cache page. This
means that if you read 1GiB worth of pages, you end up using 1GiB of
zeroed memory.
2) It is slower than using a common zero page because each page fault
has more work to do. Instead of just inserting a common zero page we
have to allocate a page cache page, zero it, and then insert it.
3) The fact that we had to check for both DAX exceptional entries and
for page cache pages in the radix tree made the DAX code more
complex.
This series solves these issues by following the lead of the DAX PMD
code and using a common 4k zero page instead. This reduces memory usage
and decreases latencies for some workloads, and it simplifies the DAX
code, removing over 100 lines in total.
This patch (of 5):
To be able to use the common 4k zero page in DAX we need to have our PTE
fault path look more like our PMD fault path where a PTE entry can be
marked as dirty and writeable as it is first inserted rather than
waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault()
call.
Right now we can rely on having a dax_pfn_mkwrite() call because we can
distinguish between these two cases in do_wp_page():
case 1: 4k zero page => writable DAX storage
case 2: read-only DAX storage => writeable DAX storage
This distinction is made by via vm_normal_page(). vm_normal_page()
returns false for the common 4k zero page, though, just as it does for
DAX ptes. Instead of special casing the DAX + 4k zero page case we will
simplify our DAX PTE page fault sequence so that it matches our DAX PMD
sequence, and get rid of the dax_pfn_mkwrite() helper. We will instead
use dax_iomap_fault() to handle write-protection faults.
This means that insert_pfn() needs to follow the lead of
insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If 'mkwrite'
is set insert_pfn() will do the work that was previously done by
wp_page_reuse() as part of the dax_pfn_mkwrite() call path.
Link: http://lkml.kernel.org/r/20170724170616.25810-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
1) Support ipv6 checksum offload in sunvnet driver, from Shannon
Nelson.
2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
Dumazet.
3) Allow generic XDP to work on virtual devices, from John Fastabend.
4) Add bpf device maps and XDP_REDIRECT, which can be used to build
arbitrary switching frameworks using XDP. From John Fastabend.
5) Remove UFO offloads from the tree, gave us little other than bugs.
6) Remove the IPSEC flow cache, from Florian Westphal.
7) Support ipv6 route offload in mlxsw driver.
8) Support VF representors in bnxt_en, from Sathya Perla.
9) Add support for forward error correction modes to ethtool, from
Vidya Sagar Ravipati.
10) Add time filter for packet scheduler action dumping, from Jamal Hadi
Salim.
11) Extend the zerocopy sendmsg() used by virtio and tap to regular
sockets via MSG_ZEROCOPY. From Willem de Bruijn.
12) Significantly rework value tracking in the BPF verifier, from Edward
Cree.
13) Add new jump instructions to eBPF, from Daniel Borkmann.
14) Rework rtnetlink plumbing so that operations can be run without
taking the RTNL semaphore. From Florian Westphal.
15) Support XDP in tap driver, from Jason Wang.
16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.
17) Add Huawei hinic ethernet driver.
18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
Delalande.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
i40e: avoid NVM acquire deadlock during NVM update
drivers: net: xgene: Remove return statement from void function
drivers: net: xgene: Configure tx/rx delay for ACPI
drivers: net: xgene: Read tx/rx delay for ACPI
rocker: fix kcalloc parameter order
rds: Fix non-atomic operation on shared flag variable
net: sched: don't use GFP_KERNEL under spin lock
vhost_net: correctly check tx avail during rx busy polling
net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
rxrpc: Make service connection lookup always check for retry
net: stmmac: Delete dead code for MDIO registration
gianfar: Fix Tx flow control deactivation
cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
cxgb4: Fix pause frame count in t4_get_port_stats
cxgb4: fix memory leak
tun: rename generic_xdp to skb_xdp
tun: reserve extra headroom only when XDP is set
net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
net: dsa: bcm_sf2: Advertise number of egress queues
...
Pull writeback error handling updates from Jeff Layton:
"This pile continues the work from last cycle on better tracking
writeback errors. In v4.13 we added some basic errseq_t infrastructure
and converted a few filesystems to use it.
This set continues refining that infrastructure, adds documentation,
and converts most of the other filesystems to use it. The main
exception at this point is the NFS client"
* tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
ecryptfs: convert to file_write_and_wait in ->fsync
mm: remove optimizations based on i_size in mapping writeback waits
fs: convert a pile of fsync routines to errseq_t based reporting
gfs2: convert to errseq_t based writeback error reporting for fsync
fs: convert sync_file_range to use errseq_t based error-tracking
mm: add file_fdatawait_range and file_write_and_wait
fuse: convert to errseq_t based error tracking for fsync
mm: consolidate dax / non-dax checks for writeback
Documentation: add some docs for errseq_t
errseq: rename __errseq_set to errseq_set
Pull file locking updates from Jeff Layton:
"This pile just has a few file locking fixes from Ben Coddington. There
are a couple of cleanup patches + an attempt to bring sanity to the
l_pid value that is reported back to userland on an F_GETLK request.
After a few gyrations, he came up with a way for filesystems to
communicate to the VFS layer code whether the pid should be translated
according to the namespace or presented as-is to userland"
* tag 'locks-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
locks: restore a warn for leaked locks on close
fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks
fs/locks: Use allocation rather than the stack in fcntl_getlk()
Pull ext4 updates from Ted Ts'o:
"Scalability improvements when allocating inodes, and some
miscellaneous bug fixes and cleanups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: avoid Y2038 overflow in recently_deleted()
ext4: fix fault handling when mounted with -o dax,ro
ext4: fix quota inconsistency during orphan cleanup for read-only mounts
ext4: fix incorrect quotaoff if the quota feature is enabled
ext4: remove useless test and assignment in strtohash functions
ext4: backward compatibility support for Lustre ea_inode implementation
ext4: remove timebomb in ext4_decode_extra_time()
ext4: use sizeof(*ptr)
ext4: in ext4_seek_{hole,data}, return -ENXIO for negative offsets
ext4: reduce lock contention in __ext4_new_inode
ext4: cleanup goto next group
ext4: do not unnecessarily allocate buffer in recently_deleted()
Pull XFS updates from Darrick Wong:
"Here are the changes for xfs for 4.14. Most of these are cleanups and
fixes for bad behavior, as we're mostly focusing on improving
reliablity this cycle (read: there's potentially a lot of stuff on the
horizon for 4.15 so better to spend a few weeks killing other bugs
now).
Summary:
- Write unmount record for a ro mount to avoid unnecessary log replay
- Clean up orphaned inodes when mounting fs readonly
- Resubmit inode log items when buffer writeback fails to avoid
umount hang
- Fix log recovery corruption problems when log headers wrap around
the end
- Avoid infinite loop searching for free inodes when inode counters
are wrong
- Evict inodes involved with log redo so that we don't leak them
later
- Fix a potential race between reclaim and inode cluster freeing
- Refactor the inode joining code w.r.t. transaction rolling &
deferred ops
- Fix a bug where the log doesn't properly deal with dirty buffers
that are about to become ordered buffers
- Fix the extent swap code to deal with making dirty buffers ordered
properly
- Consolidate page fault handlers
- Refactor the incore extent manipulation functions to use the iext
abstractions instead of directly modifying with extent data
- Disable crashy chattr +/-x until we fix it
- Don't allow us to set S_DAX for v2 inodes
- Various cleanups
- Clarify some documentation
- Fix a problem where fsync and a log commit race to send the disk a
flush command, resulting in a small window where power fail data
loss could occur
- Simplify some rmap operations in the fcollapse code
- Fix some use-after-free problems in async writeback"
* tag 'xfs-4.14-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (44 commits)
xfs: use kmem_free to free return value of kmem_zalloc
xfs: open code end_buffer_async_write in xfs_finish_page_writeback
xfs: don't set v3 xflags for v2 inodes
xfs: fix compiler warnings
fsmap: fix documentation of FMR_OF_LAST
xfs: simplify the rmap code in xfs_bmse_merge
xfs: remove unused flags arg from xfs_file_iomap_begin_delay
xfs: fix incorrect log_flushed on fsync
xfs: disable per-inode DAX flag
xfs: replace xfs_qm_get_rtblks with a direct call to xfs_bmap_count_leaves
xfs: rewrite xfs_bmap_count_leaves using xfs_iext_get_extent
xfs: use xfs_iext_*_extent helpers in xfs_bmap_split_extent_at
xfs: use xfs_iext_*_extent helpers in xfs_bmap_shift_extents
xfs: move some code around inside xfs_bmap_shift_extents
xfs: use xfs_iext_get_extent in xfs_bmap_first_unused
xfs: switch xfs_bmap_local_to_extents to use xfs_iext_insert
xfs: add a xfs_iext_update_extent helper
xfs: consolidate the various page fault handlers
iomap: return VM_FAULT_* codes from iomap_page_mkwrite
xfs: relog dirty buffers during swapext bmbt owner change
...
Improve accuracy of statfs reporting for Ceph filesystems comprising
exactly one data pool. In this case, the Ceph monitor can now report
the space usage for the single data pool instead of the global data
for the entire Ceph cluster. Include support for this message in
mon_client and leverage it in ceph/super.
Signed-off-by: Douglas Fuller <dfuller@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
These flags tell mds if there is pending capsnap explicitly.
Without this explicit notification, mds can only conclude if
client has pending capsnap. The method mds use is inefficient
and error-prone.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
mdio_mux_init() use the parameter dev for two distinct thing:
1) Have a device for all devm_ functions
2) Get device_node from it
Since it is two distinct purpose, this patch add a parameter mdio_mux
that is linked to task 2.
This will also permit to register an of_node mdio-mux that lacks a direct
owning device.
For example a mdio-mux which is a subnode of a real device.
Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull device properties framework updates from Rafael Wysocki:
"These introduce fwnode operations for all of the separate types of
'firmware nodes' that can be handled by the device properties
framework, make the framework use const fwnode arguments all over, add
a helper for the consolidated handling of node references and switch
over the framework to the new UUID API.
Specifics:
- Introduce fwnode operations for all of the separate types of
'firmware nodes' that can be handled by the device properties
framework and drop the type field from struct fwnode_handle (Sakari
Ailus, Arnd Bergmann).
- Make the device properties framework use const fwnode arguments
where possible (Sakari Ailus).
- Add a helper for the consolidated handling of node references to
the device properties framework (Sakari Ailus).
- Switch over the ACPI part of the device properties framework to the
new UUID API (Andy Shevchenko)"
* tag 'devprop-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: device property: Switch to use new generic UUID API
device property: export irqchip_fwnode_ops
device property: Introduce fwnode_property_get_reference_args
device property: Constify fwnode property API
device property: Constify argument to pset fwnode backend
ACPI: Constify internal fwnode arguments
ACPI: Constify acpi_bus helper functions, switch to macros
ACPI: Prepare for constifying acpi_get_next_subnode() fwnode argument
device property: Get rid of struct fwnode_handle type field
ACPI: Use IS_ERR_OR_NULL() instead of non-NULL check in is_acpi_data_node()
Pull ACPI updates from Rafael Wysocki:
"These include a usual ACPICA code update (this time to upstream
revision 20170728), a fix for a boot crash on some systems with
Thunderbolt devices connected at boot time, a rework of the handling
of PCI bridges when setting up device wakeup, new support for Apple
device properties, support for DMA configurations reported via ACPI on
ARM64, APEI-related updates, ACPI EC driver updates and assorted minor
modifications in several places.
Specifics:
- Update the ACPICA code in the kernel to upstream revision 20170728
including:
* Alias operator handling update (Bob Moore).
* Deferred resolution of reference package elements (Bob Moore).
* Support for the _DMA method in walk resources (Bob Moore).
* Tables handling update and support for deferred table
verification (Lv Zheng).
* Update of SMMU models for IORT (Robin Murphy).
* Compiler and disassembler updates (Alex James, Erik Schmauss,
Ganapatrao Kulkarni, James Morse).
* Tools updates (Erik Schmauss, Lv Zheng).
* Assorted minor fixes and cleanups (Bob Moore, Kees Cook, Lv
Zheng, Shao Ming).
- Rework the initialization of non-wakeup GPEs with method handlers
in order to address a boot crash on some systems with Thunderbolt
devices connected at boot time where we miss an early hotplug event
due to a delay in GPE enabling (Rafael Wysocki).
- Rework the handling of PCI bridges when setting up ACPI-based
device wakeup in order to avoid disabling wakeup for bridges
prematurely (Rafael Wysocki).
- Consolidate Apple DMI checks throughout the tree, add support for
Apple device properties to the device properties framework and use
these properties for the handling of I2C and SPI devices on Apple
systems (Lukas Wunner).
- Add support for _DMA to the ACPI-based device properties lookup
code and make it possible to use the information from there to
configure DMA regions on ARM64 systems (Lorenzo Pieralisi).
- Fix several issues in the APEI code, add support for exporting the
BERT error region over sysfs and update APEI MAINTAINERS entry with
reviewers information (Borislav Petkov, Dongjiu Geng, Loc Ho, Punit
Agrawal, Tony Luck, Yazen Ghannam).
- Fix a potential initialization ordering issue in the ACPI EC driver
and clean it up somewhat (Lv Zheng).
- Update the ACPI SPCR driver to extend the existing XGENE 8250
workaround in it to a new platform (m400) and to work around an
Xgene UART clock issue (Graeme Gregory).
- Add a new utility function to the ACPI core to support using ACPI
OEM ID / OEM Table ID / Revision for system identification in
blacklisting or similar and switch over the existing code already
using this information to this new interface (Toshi Kani).
- Fix an xpower PMIC issue related to GPADC reads that always return
0 without extra pin manipulations (Hans de Goede).
- Add statements to print debug messages in a couple of places in the
ACPI core for easier diagnostics (Rafael Wysocki).
- Clean up the ACPI processor driver slightly (Colin Ian King, Hanjun
Guo).
- Clean up the ACPI x86 boot code somewhat (Andy Shevchenko).
- Add a quirk for Dell OptiPlex 9020M to the ACPI backlight driver
(Alex Hung).
- Assorted fixes, cleanups and updates related to ACPI (Amitoj Kaur
Chawla, Bhumika Goyal, Frank Rowand, Jean Delvare, Punit Agrawal,
Ronald Tschalär, Sumeet Pawnikar)"
* tag 'acpi-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (75 commits)
ACPI / APEI: Suppress message if HEST not present
intel_pstate: convert to use acpi_match_platform_list()
ACPI / blacklist: add acpi_match_platform_list()
ACPI, APEI, EINJ: Subtract any matching Register Region from Trigger resources
ACPI: make device_attribute const
ACPI / sysfs: Extend ACPI sysfs to provide access to boot error region
ACPI: APEI: fix the wrong iteration of generic error status block
ACPI / processor: make function acpi_processor_check_duplicates() static
ACPI / EC: Clean up EC GPE mask flag
ACPI: EC: Fix possible issues related to EC initialization order
ACPI / PM: Add debug statements to acpi_pm_notify_handler()
ACPI: Add debug statements to acpi_global_event_handler()
ACPI / scan: Enable GPEs before scanning the namespace
ACPICA: Make it possible to enable runtime GPEs earlier
ACPICA: Dispatch active GPEs at init time
ACPI: SPCR: work around clock issue on xgene UART
ACPI: SPCR: extend XGENE 8250 workaround to m400
ACPI / LPSS: Don't abort ACPI scan on missing mem resource
mailbox: pcc: Drop uninformative output during boot
ACPI/IORT: Add IORT named component memory address limits
...
Pull power management updates from Rafael Wysocki:
"This time (again) cpufreq gets the majority of changes which mostly
are driver updates (including a major consolidation of intel_pstate),
some schedutil governor modifications and core cleanups.
There also are some changes in the system suspend area, mostly related
to diagnostics and debug messages plus some renames of things related
to suspend-to-idle. One major change here is that suspend-to-idle is
now going to be preferred over S3 on systems where the ACPI tables
indicate to do so and provide requsite support (the Low Power Idle S0
_DSM in particular). The system sleep documentation and the tools
related to it are updated too.
The rest is a few cpuidle changes (nothing major), devfreq updates,
generic power domains (genpd) framework updates and a few assorted
modifications elsewhere.
Specifics:
- Drop the P-state selection algorithm based on a PID controller from
intel_pstate and make it use the same P-state selection method
(based on the CPU load) for all types of systems in the active mode
(Rafael Wysocki, Srinivas Pandruvada).
- Rework the cpufreq core and governors to make it possible to take
cross-CPU utilization updates into account and modify the schedutil
governor to actually do so (Viresh Kumar).
- Clean up the handling of transition latency information in the
cpufreq core and untangle it from the information on which drivers
cannot do dynamic frequency switching (Viresh Kumar).
- Add support for new SoCs (MT2701/MT7623 and MT7622) to the mediatek
cpufreq driver and update its DT bindings (Sean Wang).
- Modify the cpufreq dt-platdev driver to autimatically create
cpufreq devices for the new (v2) Operating Performance Points (OPP)
DT bindings and update its whitelist of supported systems (Viresh
Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen, Finley
Xiao).
- Add support for Ux500 to the cpufreq-dt driver and drop the
obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann).
- Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem
Nguyen).
- Fix and clean up assorted issues in the cpufreq drivers and core
(Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva,
Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla).
- Update the IO-wait boost handling in the schedutil governor to make
it less aggressive (Joel Fernandes).
- Rework system suspend diagnostics to make it print fewer messages
to the kernel log by default, add a sysfs knob to allow more
suspend-related messages to be printed and add Low Power S0 Idle
constraints checks to the ACPI suspend-to-idle code (Rafael
Wysocki, Srinivas Pandruvada).
- Prefer suspend-to-idle over S3 on ACPI-based systems with the
ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM
interface present in the ACPI tables (Rafael Wysocki).
- Update documentation related to system sleep and rename a number of
items in the code to make it cleare that they are related to
suspend-to-idle (Rafael Wysocki).
- Export a variable allowing device drivers to check the target
system sleep state from the core system suspend code (Florian
Fainelli).
- Clean up the cpuidle subsystem to handle the polling state on x86
in a more straightforward way and to use %pOF instead of full_name
(Rafael Wysocki, Rob Herring).
- Update the devfreq framework to fix and clean up a few minor issues
(Chanwoo Choi, Rob Herring).
- Extend diagnostics in the generic power domains (genpd) framework
and clean it up slightly (Thara Gopinath, Rob Herring).
- Fix and clean up a couple of issues in the operating performance
points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz).
- Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling
(AVS) driver (David Wu).
- Fix the usage of notifiers in CPU power management on some
platforms (Alex Shi).
- Update the pm-graph system suspend/hibernation and boot profiling
utility (Todd Brandt).
- Make it possible to run the cpupower utility without CPU0 (Prarit
Bhargava)"
* tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (87 commits)
cpuidle: Make drivers initialize polling state
cpuidle: Move polling state initialization code to separate file
cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbol
cpufreq: imx6q: Fix imx6sx low frequency support
cpufreq: speedstep-lib: make several arrays static, makes code smaller
PM: docs: Delete the obsolete states.txt document
PM: docs: Describe high-level PM strategies and sleep states
PM / devfreq: Fix memory leak when fail to register device
PM / devfreq: Add dependency on PM_OPP
PM / devfreq: Move private devfreq_update_stats() into devfreq
PM / devfreq: Convert to using %pOF instead of full_name
PM / AVS: rockchip-io: add io selectors and supplies for RV1108
cpufreq: ti: Fix 'of_node_put' being called twice in error handling path
cpufreq: dt-platdev: Drop few entries from whitelist
cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2
ARM: ux500: don't select CPUFREQ_DT
cpuidle: Convert to using %pOF instead of full_name
cpufreq: Convert to using %pOF instead of full_name
PM / Domains: Convert to using %pOF instead of full_name
cpufreq: Cap the default transition delay value to 10 ms
...
NFS-over-RDMA client updates for Linux 4.14
Bugfixes and cleanups:
- Constify rpc_xprt_ops
- Harden RPC call encoding and decoding
- Clean up rpc call decoding to use xdr_streams
- Remove unused variables from various structures
- Refactor code to remove imul instructions
- Rearrange rx_stats structure for better cacheline sharing
Pull HID update from Jiri Kosina:
- Wacom driver fixes/updates (device name generation improvements,
touch ring status support) from Jason Gerecke
- T100 touchpad support from Hans de Goede
- support for batteries driven by HID input reports, from Dmitry
Torokhov
- Arnd pointed out that driver_lock semaphore is superfluous, as driver
core already provides all the necessary concurency protection.
Removal patch from Binoy Jayan
- logical minimum numbering improvements in sensor-hub driver, from
Srinivas Pandruvada
- support for Microsoft Win8 Wireless Radio Controls extensions from
João Paulo Rechi Vita
- assorted small fixes and device ID additions
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (28 commits)
HID: prodikeys: constify snd_rawmidi_ops structures
HID: sensor: constify platform_device_id
HID: input: throttle battery uevents
HID: usbmouse: constify usb_device_id and fix space before '[' error
HID: usbkbd: constify usb_device_id and fix space before '[' error.
HID: hid-sensor-hub: Force logical minimum to 1 for power and report state
HID: wacom: Do not completely map WACOM_HID_WD_TOUCHRINGSTATUS usage
HID: asus: Add T100CHI bluetooth keyboard dock touchpad support
HID: ntrig: constify attribute_group structures.
HID: logitech-hidpp: constify attribute_group structures.
HID: sensor: constify attribute_group structures.
HID: multitouch: constify attribute_group structures.
HID: multitouch: use proper symbolic constant for 0xff310076 application
HID: multitouch: Support Asus T304UA media keys
HID: multitouch: Support HID_GD_WIRELESS_RADIO_CTLS
HID: input: optionally use device id in battery name
HID: input: map digitizer battery usage
HID: Remove the semaphore driver_lock
HID: wacom: add USB_HID dependency
HID: add ALWAYS_POLL quirk for Logitech 0xc077
...
Pull GPIO updates from Linus Walleij:
"This is the bulk of the GPIO changes for the v4.14 cycle.
Not so much changes this time, phew. David Daney and Bartosz
Golaszewski did all the really interesting work in infrastructure
improvement across GPIO and IRQ core, hats off for them and to tglx
and Marc Z for general help with these patch sets.
Core changes:
- Allow the GPIO irqchip to allocate IRQs dynamically. This is an
important change on systems where only a restricted number of IRQs,
lesser than the number of GPIO lines, can be utilized. Now we can
allocate these on a first-come-first-served basis instead of
hogging up valuable IRQ lines.
- Serious fix-up of the kerneldoc documentation and inclusion into
the kerneldoc builds.
- Pulled in the IRQ simulator from the IRQ core tree and use this in
the GPIO mockup driver for exhaustive testing of interrupt
abilities.
New drivers:
- New driver for ThunderX and OCTEON-TX. This is especially
interesting as it picks up improvements from the IRQ core that
allow us to handle fasteoi ACKs upwards in a hierarchy when there
are IRQ flag latches on several levels in a hierarchy. Very
interesting work here.
- New subdriver for Renesas R-Car r8a7745 (RZ/G1E).
Misc:
- Several fixes and improvements for Xilinx Zynq GPIO.
- Support an enablement GPIO for the 74x164 GPIO.
- Switch a bunch of chips to use devres to allocate irq descriptors.
- A bunch of constification fixes"
* tag 'gpio-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (63 commits)
gpio: mockup: remove unused variable gc
gpio: pl061: constify amba_id
Revert "gpiolib: request the gpio before querying its direction"
gpio: twl6040: remove unneeded forward declaration
gpio: zevio: make gpio_chip const
gpio: add gpio_add_lookup_tables() to add several tables at once
gpio: rcar: Add r8a7745 (RZ/G1E) support
gpio: brcmstb: check return value of gpiochip_irqchip_add()
MAINTAINERS: Add entry for THUNDERX GPIO Driver.
gpio: Add gpio driver support for ThunderX and OCTEON-TX
gpio: mockup: use irq_sim
gpio: mxs: use devres for irq generic chip
gpio: mxc: use devres for irq generic chip
gpio: pch: use devres for irq generic chip
gpio: ml-ioh: use devres for irq generic chip
gpio: sta2x11: use devres for irq generic chip
gpio: sta2x11: disallow unbinding the driver
gpio: mxs: disallow unbinding the driver
gpio: mxc: disallow unbinding the driver
gpio: aspeed: Remove reference to clock name in debounce warning message
...
Pull pin control updates from Linus Walleij:
"This is the big bulk of pin control changes for the v4.14 kernel.
There are just a few bigger changes (new drivers mostly) and then a
lot of small patches all over the place.
Core changes:
- Decision to wrap the sleep mode of the Spreadtrum and in the future
others into a specially tagged state. The generic DT bindings and
the new Spreadtrum driver conforms to this. Others should be moved
over if possible.
New drivers:
- Spreadtrum SoCs especially the SC9860 SoC.
- Storlink/Cortina Gemini 3512 and 3516 SoCs.
New subdrivers:
- Intel Denverton subdriver.
- Intel Cannon Lake subdriver.
- Intel Lewisburg subdriver.
- Allwinner sunxi: R40 subdriver for A10.
- Socionext uniphier PXs3 subdriver.
- Rockchip RK3128 subdriver.
- Renesas SH-PFC R8A77995 subdriver.
Miscellaneous:
- Qualcomm APQ8064 can handle general purpose clock muxing.
- Mediatek MT7623 PCIe mux data fixed up.
- Intel GPIO IRQs are disabled during suspend.
- Several fixes and addtions to Renesas r8a7796.
- Qualcomm SPMI GPIO supports dtest route and LV/MV subtype.
- Input schmitt trigger support in Rockchip RV1108.
- Aspeed G4 and G5 USB host/device pin control control added.
- Qualcomm IPQ4019 has matured with a few missing pin groups and
control bits put in place.
- Lots of constification, this is the latest in cocinelle fixes"
* tag 'pinctrl-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (147 commits)
Revert "pinctrl: sunxi: Don't enforce bias disable (for now)"
pinctrl: uniphier: fix members of rmii group for Pro4
pinctrl: Delete an error message
pinctrl: core: Delete an error message
pinctrl: intel: Read back TX buffer state
pinctrl: rockchip: Add rv1108 recalculated iomux support
pinctrl: intel: Decrease indentation in intel_gpio_set()
pinctrl: rza1: Remove suffix from gpiochip label
pinctrl: qcom: spmi-gpio: Correct power_source range check
pinctrl: freescale: make mxs_regs const
pinctrl: aspeed: Rework strap register write logic for the AST2500
pinctrl: rza1: off by one in rza1_parse_gpiochip()
pinctrl: qcom: General Purpose clocks for apq8064
pinctrl: sprd: Add Spreadtrum pin control driver
dt-bindings: pinctrl: Add DT bindings for Spreadtrum SC9860
pinctrl: Add sleep related state to indicate sleep related configs
pinctrl: mediatek: update PCIe mux data for MT7623
pinctrl: intel: Add Intel Lewisburg GPIO support
pinctrl: intel: Add Intel Cannon Lake PCH-H pin controller support
pinctrl: aspeed: Fix ast2500 strap register write logic
...
Pull regulator updates from Mark Brown:
"This is an extremely quiet release for the regulator subsystem, it's
all fairly minor fixes and cleanups plus a few new drivers and ddevice
ID additions:
- Support for MediaTek MT6380, Ricoh RC5T619 and ST Voltage Reference
Buffers"
* tag 'regulator-v4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (24 commits)
regulator: Add support for stm32-vrefbuf
regulator: Add STM32 Voltage Reference Buffer
regulator: pv88090: Exception handling for out of bounds
regulator: da9063: Return an error code on probe failure
regulator: rn5t618: add RC5T619 PMIC support
regulator: ltc3589: constify i2c_device_id
regulator: fan53555: fix I2C device ids
regulator: add fixes with MT6397 dt-bindings shouldn't reference driver
regulator: add fixes with MT6323 dt-bindings shouldn't reference driver
regulator: add fixes with MT6311 dt-bindings shouldn't reference driver
regulator: Add document for MediaTek MT6380 regulator
regulator: mt6380: Add support for MT6380
regulator: pwm-regulator: Remove unneeded gpiod NULL check
regulator: core: fix a possible race in disable_work handling
regulator: fan53555: Use of_device_get_match_data() to simplify probe
regulator: of: regulator_of_get_init_data() missing of_node_get()
regulator: pwm-regulator: fix example syntax
regulator: Convert to using %pOF instead of full_name
regulator: cpcap: Add OF mode mapping
regulator: cpcap: Fix standby mode
...