Commit Graph

15497 Commits

Author SHA1 Message Date
Matthew Wilcox (Oracle)
d4388340ae fs: convert mpage_readpages to mpage_readahead
Implement the new readahead aop and convert all callers (block_dev,
exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
reiserfs & udf).

The callers are all trivial except for GFS2 & OCFS2.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> # ocfs2
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> # ocfs2
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-17-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
f2c817bed5 mm: use memalloc_nofs_save in readahead path
Ensure that memory allocations in the readahead path do not attempt to
reclaim file-backed pages, which could lead to a deadlock.  It is
possible, though unlikely this is the root cause of a problem observed
by Cong Wang.

Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-16-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
2d8163e489 mm: document why we don't set PageReadahead
If the page is already in cache, we don't set PageReadahead on it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-15-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
Matthew Wilcox (Oracle)
2c684234d3 mm: add page_cache_readahead_unbounded
ext4 and f2fs have duplicated the guts of the readahead code so they can
read past i_size.  Instead, separate out the guts of the readahead code
so they can call it directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-14-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Matthew Wilcox (Oracle)
b0f31d78cb mm: move end_index check out of readahead loop
By reducing nr_to_read, we can eliminate this check from inside the loop.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-13-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Matthew Wilcox (Oracle)
8151b4c8be mm: add readahead address space operation
This replaces ->readpages with a saner interface:
 - Return void instead of an ignored error code.
 - Page cache is already populated with locked pages when ->readahead
   is called.
 - New arguments can be passed to the implementation without changing
   all the filesystems that use a common helper function like
   mpage_readahead().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-12-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Matthew Wilcox (Oracle)
c1f6925e10 mm: put readahead pages in cache earlier
When populating the page cache for readahead, mappings that use
->readpages must populate the page cache themselves as the pages are
passed on a linked list which would normally be used for the page
cache's LRU.  For mappings that use ->readpage or the upcoming
->readahead method, we can put the pages into the page cache as soon as
they're allocated, which solves a race between readahead and direct IO.
It also lets us remove the gfp argument from read_pages().

Use the new readahead_page() API to implement the repeated calls to
->readpage(), just like most filesystems will.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-11-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Matthew Wilcox (Oracle)
ef8153b609 mm: remove 'page_offset' from readahead loop
Replace the page_offset variable with 'index + i'.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-10-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Matthew Wilcox (Oracle)
c2c7ad74b1 mm: rename readahead loop variable to 'i'
Change the type of page_idx to unsigned long, and rename it -- it's just
a loop counter, not a page index.

Suggested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-9-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Matthew Wilcox (Oracle)
08eb9658ae mm: rename various 'offset' parameters to 'index'
The word 'offset' is used ambiguously to mean 'byte offset within a
page', 'byte offset from the start of the file' and 'page offset from
the start of the file'.

Use 'index' to mean 'page offset from the start of the file' throughout
the readahead code.

[ We should probably rename the 'pgoff_t' type to 'pgidx_t' too - Linus ]

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-8-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Matthew Wilcox (Oracle)
a4d9653666 mm: use readahead_control to pass arguments
In this patch, only between __do_page_cache_readahead() and
read_pages(), but it will be extended in upcoming patches.  The
read_pages() function becomes aops centric, as this makes the most sense
by the end of the patchset.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-7-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Matthew Wilcox (Oracle)
ad4ae1c732 mm: move readahead nr_pages check into read_pages
Simplify the callers by moving the check for nr_pages and the BUG_ON
into read_pages().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-5-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Matthew Wilcox (Oracle)
a1ef856652 mm: ignore return value of ->readpages
We used to assign the return value to a variable, which we then ignored.
Remove the pretence of caring.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-4-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Matthew Wilcox (Oracle)
9a42823a10 mm: return void from various readahead functions
ondemand_readahead has two callers, neither of which use the return
value.  That means that both ra_submit and __do_page_cache_readahead()
can return void, and we don't need to worry that a present page in the
readahead window causes us to return a smaller nr_pages than we ought to
have.

Similarly, no caller uses the return value from
force_page_cache_readahead().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-3-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Matthew Wilcox (Oracle)
cee9a0c4e8 mm: move readahead prototypes from mm.h
Patch series "Change readahead API", v11.

This series adds a readahead address_space operation to replace the
readpages operation.  The key difference is that pages are added to the
page cache as they are allocated (and then looked up by the filesystem)
instead of passing them on a list to the readpages operation and having
the filesystem add them to the page cache.  It's a net reduction in code
for each implementation, more efficient than walking a list, and solves
the direct-write vs buffered-read problem reported by yu kuai at
http://lkml.kernel.org/r/20200116063601.39201-1-yukuai3@huawei.com

The only unconverted filesystems are those which use fscache.  Their
conversion is pending Dave Howells' rewrite which will make the
conversion substantially easier.  This should be completed by the end of
the year.

I want to thank the reviewers/testers; Dave Chinner, John Hubbard, Eric
Biggers, Johannes Thumshirn, Dave Sterba, Zi Yan, Christoph Hellwig and
Miklos Szeredi have done a marvellous job of providing constructive
criticism.

These patches pass an xfstests run on ext4, xfs & btrfs with no
regressions that I can tell (some of the tests seem a little flaky
before and remain flaky afterwards).

This patch (of 25):

The readahead code is part of the page cache so should be found in the
pagemap.h file.  force_page_cache_readahead is only used within mm, so
move it to mm/internal.h instead.  Remove the parameter names where they
add no value, and rename the ones which were actively misleading.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-1-willy@infradead.org
Link: http://lkml.kernel.org/r/20200414150233.24495-2-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Vlastimil Babka
002ae70570 mm, dump_page(): do not crash with invalid mapping pointer
We have seen a following problem on a RPi4 with 1G RAM:

    BUG: Bad page state in process systemd-hwdb  pfn:35601
    page:ffff7e0000d58040 refcount:15 mapcount:131221 mapping:efd8fe765bc80080 index:0x1 compound_mapcount: -32767
    Unable to handle kernel paging request at virtual address efd8fe765bc80080
    Mem abort info:
      ESR = 0x96000004
      Exception class = DABT (current EL), IL = 32 bits
      SET = 0, FnV = 0
      EA = 0, S1PTW = 0
    Data abort info:
      ISV = 0, ISS = 0x00000004
      CM = 0, WnR = 0
    [efd8fe765bc80080] address between user and kernel address ranges
    Internal error: Oops: 96000004 [#1] SMP
    Modules linked in: btrfs libcrc32c xor xor_neon zlib_deflate raid6_pq mmc_block xhci_pci xhci_hcd usbcore sdhci_iproc sdhci_pltfm sdhci mmc_core clk_raspberrypi gpio_raspberrypi_exp pcie_brcmstb bcm2835_dma gpio_regulator phy_generic fixed sg scsi_mod efivarfs
    Supported: No, Unreleased kernel
    CPU: 3 PID: 408 Comm: systemd-hwdb Not tainted 5.3.18-8-default #1 SLE15-SP2 (unreleased)
    Hardware name: raspberrypi rpi/rpi, BIOS 2020.01 02/21/2020
    pstate: 40000085 (nZcv daIf -PAN -UAO)
    pc : __dump_page+0x268/0x368
    lr : __dump_page+0xc4/0x368
    sp : ffff000012563860
    x29: ffff000012563860 x28: ffff80003ddc4300
    x27: 0000000000000010 x26: 000000000000003f
    x25: ffff7e0000d58040 x24: 000000000000000f
    x23: efd8fe765bc80080 x22: 0000000000020095
    x21: efd8fe765bc80080 x20: ffff000010ede8b0
    x19: ffff7e0000d58040 x18: ffffffffffffffff
    x17: 0000000000000001 x16: 0000000000000007
    x15: ffff000011689708 x14: 3030386362353637
    x13: 6566386466653a67 x12: 6e697070616d2031
    x11: 32323133313a746e x10: 756f6370616d2035
    x9 : ffff00001168a840 x8 : ffff00001077a670
    x7 : 000000000000013d x6 : ffff0000118a43b5
    x5 : 0000000000000001 x4 : ffff80003dd9e2c8
    x3 : ffff80003dd9e2c8 x2 : 911c8d7c2f483500
    x1 : dead000000000100 x0 : efd8fe765bc80080
    Call trace:
     __dump_page+0x268/0x368
     bad_page+0xd4/0x168
     check_new_page_bad+0x80/0xb8
     rmqueue_bulk.constprop.26+0x4d8/0x788
     get_page_from_freelist+0x4d4/0x1228
     __alloc_pages_nodemask+0x134/0xe48
     alloc_pages_vma+0x198/0x1c0
     do_anonymous_page+0x1a4/0x4d8
     __handle_mm_fault+0x4e8/0x560
     handle_mm_fault+0x104/0x1e0
     do_page_fault+0x1e8/0x4c0
     do_translation_fault+0xb0/0xc0
     do_mem_abort+0x50/0xb0
     el0_da+0x24/0x28
    Code: f9401025 8b8018a0 9a851005 17ffffca (f94002a0)

Besides the underlying issue with page->mapping containing a bogus value
for some reason, we can see that __dump_page() crashed by trying to read
the pointer at mapping->host, turning a recoverable warning into full
Oops.

It can be expected that when page is reported as bad state for some
reason, the pointers there should not be trusted blindly.

So this patch treats all data in __dump_page() that depends on
page->mapping as lava, using probe_kernel_read_strict().  Ideally this
would include the dentry->d_parent recursively, but that would mean
changing printk handler for %pd.  Chances of reaching the dentry
printing part with an initially bogus mapping pointer should be rather
low, though.

Also prefix printing mapping->a_ops with a description of what is being
printed.  In case the value is bogus, %ps will print raw value instead
of the symbol name and then it's not obvious at all that it's printing
a_ops.

Reported-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Link: http://lkml.kernel.org/r/20200331165454.12263-1-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Qian Cai
a68ee05739 mm/slub: fix stack overruns with SLUB_STATS
There is no need to copy SLUB_STATS items from root memcg cache to new
memcg cache copies.  Doing so could result in stack overruns because the
store function only accepts 0 to clear the stat and returns an error for
everything else while the show method would print out the whole stat.

Then, the mismatch of the lengths returns from show and store methods
happens in memcg_propagate_slab_attrs():

	else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf))
		buf = mbuf;

max_attr_size is only 2 from slab_attr_store(), then, it uses mbuf[64]
in show_stat() later where a bounch of sprintf() would overrun the stack
variable.  Fix it by always allocating a page of buffer to be used in
show_stat() if SLUB_STATS=y which should only be used for debug purpose.

  # echo 1 > /sys/kernel/slab/fs_cache/shrink
  BUG: KASAN: stack-out-of-bounds in number+0x421/0x6e0
  Write of size 1 at addr ffffc900256cfde0 by task kworker/76:0/53251

  Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
  Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
  Call Trace:
    number+0x421/0x6e0
    vsnprintf+0x451/0x8e0
    sprintf+0x9e/0xd0
    show_stat+0x124/0x1d0
    alloc_slowpath_show+0x13/0x20
    __kmem_cache_create+0x47a/0x6b0

  addr ffffc900256cfde0 is located in stack of task kworker/76:0/53251 at offset 0 in frame:
   process_one_work+0x0/0xb90

  this frame has 1 object:
   [32, 72) 'lockdep_map'

  Memory state around the buggy address:
   ffffc900256cfc80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   ffffc900256cfd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  >ffffc900256cfd80: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
                                                         ^
   ffffc900256cfe00: 00 00 00 00 00 f2 f2 f2 00 00 00 00 00 00 00 00
   ffffc900256cfe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ==================================================================
  Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: __kmem_cache_create+0x6ac/0x6b0
  Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
  Call Trace:
    __kmem_cache_create+0x6ac/0x6b0

Fixes: 107dab5c92 ("slub: slub-specific propagation changes")
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Glauber Costa <glauber@scylladb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/20200429222356.4322-1-cai@lca.pw
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Christopher Lameter
aa456c7aeb slub: remove kmalloc under list_lock from list_slab_objects() V2
list_slab_objects() is called when a slab is destroyed and there are
objects still left to list the objects in the syslog.  This is a pretty
rare event.

And there it seems we take the list_lock and call kmalloc while holding
that lock.

Perform the allocation in free_partial() before the list_lock is taken.

Fixes: bbd7d57bfe ("slub: Potential stack overflow")
Signed-off-by: Christopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Yu Zhao <yuzhao@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2002031721250.1668@www.lameter.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Christoph Lameter
d7660ce591 slub: Remove userspace notifier for cache add/remove
I came across some unnecessary uevents once again which reminded me
this.  The patch seems to be lost in the leaves of the original
discussion [1], so resending.

[1] https://lore.kernel.org/r/alpine.DEB.2.21.2001281813130.745@www.lameter.com

Kmem caches are internal kernel structures so it is strange that
userspace notifiers would be needed.  And I am not aware of any use of
these notifiers.  These notifiers may just exist because in the initial
slub release the sysfs code was copied from another subsystem.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Koutný <mkoutny@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/20200423115721.19821-1-mkoutny@suse.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Dongli Zhang
52f2347808 mm/slub.c: fix corrupted freechain in deactivate_slab()
The slub_debug is able to fix the corrupted slab freelist/page.
However, alloc_debug_processing() only checks the validity of current
and next freepointer during allocation path.  As a result, once some
objects have their freepointers corrupted, deactivate_slab() may lead to
page fault.

Below is from a test kernel module when 'slub_debug=PUF,kmalloc-128
slub_nomerge'.  The test kernel corrupts the freepointer of one free
object on purpose.  Unfortunately, deactivate_slab() does not detect it
when iterating the freechain.

  BUG: unable to handle page fault for address: 00000000123456f8
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  ... ...
  RIP: 0010:deactivate_slab.isra.92+0xed/0x490
  ... ...
  Call Trace:
   ___slab_alloc+0x536/0x570
   __slab_alloc+0x17/0x30
   __kmalloc+0x1d9/0x200
   ext4_htree_store_dirent+0x30/0xf0
   htree_dirblock_to_tree+0xcb/0x1c0
   ext4_htree_fill_tree+0x1bc/0x2d0
   ext4_readdir+0x54f/0x920
   iterate_dir+0x88/0x190
   __x64_sys_getdents+0xa6/0x140
   do_syscall_64+0x49/0x170
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Therefore, this patch adds extra consistency check in deactivate_slab().
Once an object's freepointer is corrupted, all following objects
starting at this object are isolated.

[akpm@linux-foundation.org: fix build with CONFIG_SLAB_DEBUG=n]
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/20200331031450.12182-1-dongli.zhang@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Vlastimil Babka
49f2d2419d usercopy: mark dma-kmalloc caches as usercopy caches
We have seen a "usercopy: Kernel memory overwrite attempt detected to
SLUB object 'dma-kmalloc-1 k' (offset 0, size 11)!" error on s390x, as
IUCV uses kmalloc() with __GFP_DMA because of memory address
restrictions.  The issue has been discussed [2] and it has been noted
that if all the kmalloc caches are marked as usercopy, there's little
reason not to mark dma-kmalloc caches too.  The 'dma' part merely means
that __GFP_DMA is used to restrict memory address range.

As Jann Horn put it [3]:
 "I think dma-kmalloc slabs should be handled the same way as normal
  kmalloc slabs. When a dma-kmalloc allocation is freshly created, it is
  just normal kernel memory - even if it might later be used for DMA -,
  and it should be perfectly fine to copy_from_user() into such
  allocations at that point, and to copy_to_user() out of them at the
  end. If you look at the places where such allocations are created, you
  can see things like kmemdup(), memcpy() and so on - all normal
  operations that shouldn't conceptually be different from usercopy in
  any relevant way."

Thus this patch marks the dma-kmalloc-* caches as usercopy.

[1] https://bugzilla.suse.com/show_bug.cgi?id=1156053
[2] https://lore.kernel.org/kernel-hardening/bfca96db-bbd0-d958-7732-76e36c667c68@suse.cz/
[3] https://lore.kernel.org/kernel-hardening/CAG48ez1a4waGk9kB0WLaSbs4muSoK0AYAVk8=XYaKj4_+6e6Hg@mail.gmail.com/

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Jiri Slaby <jslaby@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Julian Wiedmann <jwi@linux.ibm.com>
Cc: Ursula Braun <ubraun@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: David Windsor <dave@nullcore.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Luis de Bethencourt <luisbg@kernel.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Matthew Garrett <mjg59@google.com>
Cc: Michal Kubecek <mkubecek@suse.cz>
Link: http://lkml.kernel.org/r/7d810f6d-8085-ea2f-7805-47ba3842dc50@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:06 -07:00
Linus Torvalds
17839856fd gup: document and work around "COW can break either way" issue
Doing a "get_user_pages()" on a copy-on-write page for reading can be
ambiguous: the page can be COW'ed at any time afterwards, and the
direction of a COW event isn't defined.

Yes, whoever writes to it will generally do the COW, but if the thread
that did the get_user_pages() unmapped the page before the write (and
that could happen due to memory pressure in addition to any outright
action), the writer could also just take over the old page instead.

End result: the get_user_pages() call might result in a page pointer
that is no longer associated with the original VM, and is associated
with - and controlled by - another VM having taken it over instead.

So when doing a get_user_pages() on a COW mapping, the only really safe
thing to do would be to break the COW when getting the page, even when
only getting it for reading.

At the same time, some users simply don't even care.

For example, the perf code wants to look up the page not because it
cares about the page, but because the code simply wants to look up the
physical address of the access for informational purposes, and doesn't
really care about races when a page might be unmapped and remapped
elsewhere.

This adds logic to force a COW event by setting FOLL_WRITE on any
copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
pointer as a result.

The current semantics end up being:

 - __get_user_pages_fast(): no change. If you don't ask for a write,
   you won't break COW. You'd better know what you're doing.

 - get_user_pages_fast(): the fast-case "look it up in the page tables
   without anything getting mmap_sem" now refuses to follow a read-only
   page, since it might need COW breaking.  Which happens in the slow
   path - the fast path doesn't know if the memory might be COW or not.

 - get_user_pages() (including the slow-path fallback for gup_fast()):
   for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
   very similar semantics to FOLL_FORCE.

If it turns out that we want finer granularity (ie "only break COW when
it might actually matter" - things like the zero page are special and
don't need to be broken) we might need to push these semantics deeper
into the lookup fault path.  So if people care enough, it's possible
that we might end up adding a new internal FOLL_BREAK_COW flag to go
with the internal FOLL_COW flag we already have for tracking "I had a
COW".

Alternatively, if it turns out that different callers might want to
explicitly control the forced COW break behavior, we might even want to
make such a flag visible to the users of get_user_pages() instead of
using the above default semantics.

But for now, this is mostly commentary on the issue (this commit message
being a lot bigger than the patch, and that patch in turn is almost all
comments), with that minimal "enable COW breaking early" logic using the
existing FOLL_WRITE behavior.

[ It might be worth noting that we've always had this ambiguity, and it
  could arguably be seen as a user-space issue.

  You only get private COW mappings that could break either way in
  situations where user space is doing cooperative things (ie fork()
  before an execve() etc), but it _is_ surprising and very subtle, and
  fork() is supposed to give you independent address spaces.

  So let's treat this as a kernel issue and make the semantics of
  get_user_pages() easier to understand. Note that obviously a true
  shared mapping will still get a page that can change under us, so this
  does _not_ mean that get_user_pages() somehow returns any "stable"
  page ]

Reported-by: Jann Horn <jannh@google.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill Shutemov <kirill@shutemov.name>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:19:17 -07:00
Linus Torvalds
b23c4771ff Merge tag 'docs-5.8' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
 "A fair amount of stuff this time around, dominated by yet another
  massive set from Mauro toward the completion of the RST conversion. I
  *really* hope we are getting close to the end of this. Meanwhile,
  those patches reach pretty far afield to update document references
  around the tree; there should be no actual code changes there. There
  will be, alas, more of the usual trivial merge conflicts.

  Beyond that we have more translations, improvements to the sphinx
  scripting, a number of additions to the sysctl documentation, and lots
  of fixes"

* tag 'docs-5.8' of git://git.lwn.net/linux: (130 commits)
  Documentation: fixes to the maintainer-entry-profile template
  zswap: docs/vm: Fix typo accept_threshold_percent in zswap.rst
  tracing: Fix events.rst section numbering
  docs: acpi: fix old http link and improve document format
  docs: filesystems: add info about efivars content
  Documentation: LSM: Correct the basic LSM description
  mailmap: change email for Ricardo Ribalda
  docs: sysctl/kernel: document unaligned controls
  Documentation: admin-guide: update bug-hunting.rst
  docs: sysctl/kernel: document ngroups_max
  nvdimm: fixes to maintainter-entry-profile
  Documentation/features: Correct RISC-V kprobes support entry
  Documentation/features: Refresh the arch support status files
  Revert "docs: sysctl/kernel: document ngroups_max"
  docs: move locking-specific documents to locking/
  docs: move digsig docs to the security book
  docs: move the kref doc into the core-api book
  docs: add IRQ documentation at the core-api book
  docs: debugging-via-ohci1394.txt: add it to the core-api book
  docs: fix references for ipmi.rst file
  ...
2020-06-01 15:45:27 -07:00
Linus Torvalds
533b220f7b Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
 "A sizeable pile of arm64 updates for 5.8.

  Summary below, but the big two features are support for Branch Target
  Identification and Clang's Shadow Call stack. The latter is currently
  arm64-only, but the high-level parts are all in core code so it could
  easily be adopted by other architectures pending toolchain support

  Branch Target Identification (BTI):

   - Support for ARMv8.5-BTI in both user- and kernel-space. This allows
     branch targets to limit the types of branch from which they can be
     called and additionally prevents branching to arbitrary code,
     although kernel support requires a very recent toolchain.

   - Function annotation via SYM_FUNC_START() so that assembly functions
     are wrapped with the relevant "landing pad" instructions.

   - BPF and vDSO updates to use the new instructions.

   - Addition of a new HWCAP and exposure of BTI capability to userspace
     via ID register emulation, along with ELF loader support for the
     BTI feature in .note.gnu.property.

   - Non-critical fixes to CFI unwind annotations in the sigreturn
     trampoline.

  Shadow Call Stack (SCS):

   - Support for Clang's Shadow Call Stack feature, which reserves
     platform register x18 to point at a separate stack for each task
     that holds only return addresses. This protects function return
     control flow from buffer overruns on the main stack.

   - Save/restore of x18 across problematic boundaries (user-mode,
     hypervisor, EFI, suspend, etc).

   - Core support for SCS, should other architectures want to use it
     too.

   - SCS overflow checking on context-switch as part of the existing
     stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.

  CPU feature detection:

   - Removed numerous "SANITY CHECK" errors when running on a system
     with mismatched AArch32 support at EL1. This is primarily a concern
     for KVM, which disabled support for 32-bit guests on such a system.

   - Addition of new ID registers and fields as the architecture has
     been extended.

  Perf and PMU drivers:

   - Minor fixes and cleanups to system PMU drivers.

  Hardware errata:

   - Unify KVM workarounds for VHE and nVHE configurations.

   - Sort vendor errata entries in Kconfig.

  Secure Monitor Call Calling Convention (SMCCC):

   - Update to the latest specification from Arm (v1.2).

   - Allow PSCI code to query the SMCCC version.

  Software Delegated Exception Interface (SDEI):

   - Unexport a bunch of unused symbols.

   - Minor fixes to handling of firmware data.

  Pointer authentication:

   - Add support for dumping the kernel PAC mask in vmcoreinfo so that
     the stack can be unwound by tools such as kdump.

   - Simplification of key initialisation during CPU bringup.

  BPF backend:

   - Improve immediate generation for logical and add/sub instructions.

  vDSO:

   - Minor fixes to the linker flags for consistency with other
     architectures and support for LLVM's unwinder.

   - Clean up logic to initialise and map the vDSO into userspace.

  ACPI:

   - Work around for an ambiguity in the IORT specification relating to
     the "num_ids" field.

   - Support _DMA method for all named components rather than only PCIe
     root complexes.

   - Minor other IORT-related fixes.

  Miscellaneous:

   - Initialise debug traps early for KGDB and fix KDB cacheflushing
     deadlock.

   - Minor tweaks to early boot state (documentation update, set
     TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).

   - Refactoring and cleanup"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
  KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h
  KVM: arm64: Check advertised Stage-2 page size capability
  arm64/cpufeature: Add get_arm64_ftr_reg_nowarn()
  ACPI/IORT: Remove the unused __get_pci_rid()
  arm64/cpuinfo: Add ID_MMFR4_EL1 into the cpuinfo_arm64 context
  arm64/cpufeature: Add remaining feature bits in ID_AA64PFR1 register
  arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register
  arm64/cpufeature: Add remaining feature bits in ID_AA64ISAR0 register
  arm64/cpufeature: Add remaining feature bits in ID_MMFR4 register
  arm64/cpufeature: Add remaining feature bits in ID_PFR0 register
  arm64/cpufeature: Introduce ID_MMFR5 CPU register
  arm64/cpufeature: Introduce ID_DFR1 CPU register
  arm64/cpufeature: Introduce ID_PFR2 CPU register
  arm64/cpufeature: Make doublelock a signed feature in ID_AA64DFR0
  arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
  arm64/cpufeature: Add explicit ftr_id_isar0[] for ID_ISAR0 register
  arm64: mm: Add asid_gen_match() helper
  firmware: smccc: Fix missing prototype warning for arm_smccc_version_init
  arm64: vdso: Fix CFI directives in sigreturn trampoline
  arm64: vdso: Don't prefix sigreturn trampoline with a BTI C instruction
  ...
2020-06-01 15:18:27 -07:00
Linus Torvalds
60056060be Merge tag 'locking-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The biggest change to core locking facilities in this cycle is the
  introduction of local_lock_t - this primitive comes from the -rt
  project and identifies CPU-local locking dependencies normally handled
  opaquely beind preempt_disable() or local_irq_save/disable() critical
  sections.

  The generated code on mainline kernels doesn't change as a result, but
  still there are benefits: improved debugging and better documentation
  of data structure accesses.

  The new local_lock_t primitives are introduced and then utilized in a
  couple of kernel subsystems. No change in functionality is intended.

  There's also other smaller changes and cleanups"

* tag 'locking-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  zram: Use local lock to protect per-CPU data
  zram: Allocate struct zcomp_strm as per-CPU memory
  connector/cn_proc: Protect send_msg() with a local lock
  squashfs: Make use of local lock in multi_cpu decompressor
  mm/swap: Use local_lock for protection
  radix-tree: Use local_lock for protection
  locking: Introduce local_lock()
  locking/lockdep: Replace zero-length array with flexible-array
  locking/rtmutex: Remove unused rt_mutex_cmpxchg_relaxed()
2020-06-01 13:03:31 -07:00
Rafael J. Wysocki
48ccdeddc5 Merge branches 'acpi-apei', 'acpi-pmic', 'acpi-video' and 'acpi-dptf'
* acpi-apei:
  arm64: acpi: Make apei_claim_sea() synchronise with APEI's irq work
  ACPI: APEI: Kick the memory_failure() queue for synchronous errors
  mm/memory-failure: Add memory_failure_queue_kick()

* acpi-pmic:
  ACPI / PMIC: Add i2c address for thermal control

* acpi-video:
  ACPI: video: Use native backlight on Acer TravelMate 5735Z

* acpi-dptf:
  ACPI: DPTF: Add battery participant driver
  ACPI: DPTF: Additional sysfs attributes for power participant driver
2020-06-01 17:19:43 +02:00
David S. Miller
1806c13dc2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.

The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-31 17:48:46 -07:00
Hugh Dickins
2f33a70602 mm,thp: stop leaking unreleased file pages
When collapse_file() calls try_to_release_page(), it has already isolated
the page: so if releasing buffers happens to fail (as it sometimes does),
remember to putback_lru_page(): otherwise that page is left unreclaimable
and unfreeable, and the file extent uncollapsible.

Fixes: 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: <stable@vger.kernel.org>	[5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005231837500.1766@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-28 11:35:40 -07:00
Qian Cai
af4798a5bb mm/z3fold: silence kmemleak false positives of slots
Kmemleak reported many leaks while under memory pressue in,

    slots = alloc_slots(pool, gfp);

which is referenced by "zhdr" in init_z3fold_page(),

    zhdr->slots = slots;

However, "zhdr" could be gone without freeing slots as the later will be
freed separately when the last "handle" off of "handles" array is freed.
It will be within "slots" which is always aligned.

  unreferenced object 0xc000000fdadc1040 (size 104):
  comm "oom04", pid 140476, jiffies 4295359280 (age 3454.970s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    z3fold_zpool_malloc+0x7b0/0xe10
    alloc_slots at mm/z3fold.c:214
    (inlined by) init_z3fold_page at mm/z3fold.c:412
    (inlined by) z3fold_alloc at mm/z3fold.c:1161
    (inlined by) z3fold_zpool_malloc at mm/z3fold.c:1735
    zpool_malloc+0x34/0x50
    zswap_frontswap_store+0x60c/0xda0
    zswap_frontswap_store at mm/zswap.c:1093
    __frontswap_store+0x128/0x330
    swap_writepage+0x58/0x110
    pageout+0x16c/0xa40
    shrink_page_list+0x1ac8/0x25c0
    shrink_inactive_list+0x270/0x730
    shrink_lruvec+0x444/0xf30
    shrink_node+0x2a4/0x9c0
    do_try_to_free_pages+0x158/0x640
    try_to_free_pages+0x1bc/0x5f0
    __alloc_pages_slowpath.constprop.60+0x4dc/0x15a0
    __alloc_pages_nodemask+0x520/0x650
    alloc_pages_vma+0xc0/0x420
    handle_mm_fault+0x1174/0x1bf0

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vitaly Wool <vitaly.wool@konsulko.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: http://lkml.kernel.org/r/20200522220052.2225-1-cai@lca.pw
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-28 11:35:40 -07:00
Ingo Molnar
b01b214199 mm/swap: Use local_lock for protection
The various struct pagevec per CPU variables are protected by disabling
either preemption or interrupts across the critical sections. Inside
these sections spinlocks have to be acquired.

These spinlocks are regular spinlock_t types which are converted to
"sleeping" spinlocks on PREEMPT_RT enabled kernels. Obviously sleeping
locks cannot be acquired in preemption or interrupt disabled sections.

local locks provide a trivial way to substitute preempt and interrupt
disable instances. On a non PREEMPT_RT enabled kernel local_lock() maps
to preempt_disable() and local_lock_irq() to local_irq_disable().

Create lru_rotate_pvecs containing the pagevec and the locallock.
Create lru_pvecs containing the remaining pagevecs and the locallock.
Add lru_add_drain_cpu_zone() which is used from compact_zone() to avoid
exporting the pvec structure.

Change the relevant call sites to acquire these locks instead of using
preempt_disable() / get_cpu() / get_cpu_var() and local_irq_disable() /
local_irq_save().

There is neither a functional change nor a change in the generated
binary code for non PREEMPT_RT enabled non-debug kernels.

When lockdep is enabled local locks have lockdep maps embedded. These
allow lockdep to validate the protections, i.e. inappropriate usage of a
preemption only protected sections would result in a lockdep warning
while the same problem would not be noticed with a plain
preempt_disable() based protection.

local locks also improve readability as they provide a named scope for
the protections while preempt/interrupt disable are opaque scopeless.

Finally local locks allow PREEMPT_RT to substitute them with real
locking primitives to ensure the correctness of operation in a fully
preemptible kernel.

[ bigeasy: Adopted to use local_lock ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200527201119.1692513-4-bigeasy@linutronix.de
2020-05-28 10:31:10 +02:00
Bibo Mao
44bf431b47 mm/memory.c: Add memory read privilege on page fault handling
Here add pte_sw_mkyoung function to make page readable on MIPS
platform during page fault handling. This patch improves page
fault latency about 10% on my MIPS machine with lmbench
lat_pagefault case.

It is noop function on other arches, there is no negative
influence on those architectures.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2020-05-27 13:06:40 +02:00
Bibo Mao
7df6769743 mm/memory.c: Update local TLB if PTE entry exists
If two threads concurrently fault at the same page, the thread that
won the race updates the PTE and its local TLB. For now, the other
thread gives up, simply does nothing, and continues.

It could happen that this second thread triggers another fault, whereby
it only updates its local TLB while handling the fault. Instead of
triggering another fault, let's directly update the local TLB of the
second thread. Function update_mmu_tlb is used here to update local
TLB on the second thread, and it is defined as empty on other arches.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2020-05-27 13:06:09 +02:00
Goldwyn Rodrigues
d85dc2e116 fs: export generic_file_buffered_read()
Export generic_file_buffered_read() to be used to supplement incomplete
direct reads.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 13:12:37 +02:00
David S. Miller
13209a8f73 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-24 13:47:27 -07:00
Uladzislau Rezki
d8f117abb3 z3fold: fix use-after-free when freeing handles
free_handle() for a foreign handle may race with inter-page compaction,
what can lead to memory corruption.

To avoid that, take write lock not read lock in free_handle to be
synchronized with __release_z3fold_page().

For example KASAN can detect it:

  ==================================================================
  BUG: KASAN: use-after-free in LZ4_decompress_safe+0x2c4/0x3b8
  Read of size 1 at addr ffffffc976695ca3 by task GoogleApiHandle/4121

  CPU: 0 PID: 4121 Comm: GoogleApiHandle Tainted: P S         OE     4.19.81-perf+ #162
  Hardware name: Sony Mobile Communications. PDX-203(KONA) (DT)
  Call trace:
     LZ4_decompress_safe+0x2c4/0x3b8
     lz4_decompress_crypto+0x3c/0x70
     crypto_decompress+0x58/0x70
     zcomp_decompress+0xd4/0x120
     ...

Apart from that, initialize zhdr->mapped_count in init_z3fold_page() and
remove "newpage" variable because it is not used anywhere.

Signed-off-by: Uladzislau Rezki <uladzislau.rezki@sony.com>
Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Raymond Jennings <shentino@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200520082100.28876-1-vitaly.wool@konsulko.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-23 10:26:32 -07:00
Marco Elver
33cd65e73a kasan: disable branch tracing for core runtime
During early boot, while KASAN is not yet initialized, it is possible to
enter reporting code-path and end up in kasan_report().

While uninitialized, the branch there prevents generating any reports,
however, under certain circumstances when branches are being traced
(TRACE_BRANCH_PROFILING), we may recurse deep enough to cause kernel
reboots without warning.

To prevent similar issues in future, we should disable branch tracing
for the core runtime.

[elver@google.com: remove duplicate DISABLE_BRANCH_PROFILING, per Qian Cai]
  Link: https://lore.kernel.org/lkml/20200517011732.GE24705@shao2-debian/
  Link: http://lkml.kernel.org/r/20200522075207.157349-1-elver@google.com
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r//20200517011732.GE24705@shao2-debian/
Link: http://lkml.kernel.org/r/20200519182459.87166-1-elver@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-23 10:26:31 -07:00
James Morse
062022315e mm/memory-failure: Add memory_failure_queue_kick()
The GHES code calls memory_failure_queue() from IRQ context to schedule
work on the current CPU so that memory_failure() can sleep.

For synchronous memory errors the arch code needs to know any signals
that memory_failure() will trigger are pending before it returns to
user-space, possibly when exiting from the IRQ.

Add a helper to kick the memory failure queue, to ensure the scheduled
work has happened. This has to be called from process context, so may
have been migrated from the original cpu. Pass the cpu the work was
queued on.

Change memory_failure_work_func() to permit being called on the 'wrong'
cpu.

Signed-off-by: James Morse <james.morse@arm.com>
Tested-by: Tyler Baicar <baicar@os.amperecomputing.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-05-19 19:51:10 +02:00
David S. Miller
da07f52d3c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Move the bpf verifier trace check into the new switch statement in
HEAD.

Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 13:48:59 -07:00
Sami Tolvanen
628d06a48f scs: Add page accounting for shadow call stack allocations
This change adds accounting for the memory allocated for shadow stacks.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-15 16:35:49 +01:00
Andrey Konovalov
13cf048802 kasan: add missing functions declarations to kasan.h
KASAN is currently missing declarations for __asan_report* and __hwasan*
functions.  This can lead to compiler warnings.

Reported-by: Leon Romanovsky <leon@kernel.org>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Leon Romanovsky <leon@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: http://lkml.kernel.org/r/45b445a76a79208918f0cc44bfabebaea909b54d.1589297433.git.andreyknvl@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-14 10:00:35 -07:00
Andrey Konovalov
8a16c09edc kasan: consistently disable debugging features
KASAN is incompatible with some kernel debugging/tracing features.

There's been multiple patches that disable those feature for some of
KASAN files one by one.  Instead of prolonging that, disable these
features for all KASAN files at once.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Link: http://lkml.kernel.org/r/29bd753d5ff5596425905b0b07f51153e2345cc1.1589297433.git.andreyknvl@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-14 10:00:35 -07:00
Brian Geffon
d156492606 userfaultfd: fix remap event with MREMAP_DONTUNMAP
A user is not required to set a new address when using MREMAP_DONTUNMAP
as it can be used without MREMAP_FIXED.  When doing so the remap event
will use new_addr which may not have been set and we didn't propagate it
back other then in the return value of remap_to.

Because ret is always the new address it's probably more correct to use
it rather than new_addr on the remap_event_complete call, and it
resolves this bug.

Fixes: e346b38130 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Link: http://lkml.kernel.org/r/20200506172158.218366-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-14 10:00:35 -07:00
Peter Xu
475f4dfc02 mm/gup: fix fixup_user_fault() on multiple retries
This part was overlooked when reworking the gup code on multiple
retries.

When we get the 2nd+ retry, we'll be with TRIED flag set.  Current code
will bail out on the 2nd retry because the !TRIED check will fail so the
retry logic will be skipped.  What's worse is that, it will also return
zero which errornously hints the caller that the page is faulted in
while it's not.

The !TRIED flag check seems to not be needed even before the mutliple
retries change because if we get a VM_FAULT_RETRY, it must be the 1st
retry, and we should not have TRIED set for that.

Fix it by removing the !TRIED check, at the meantime check against fatal
signals properly before the page fault so we can still properly respond
to the user killing the process during retries.

Fixes: 4426e945df ("mm/gup: allow VM_FAULT_RETRY for multiple times")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Link: http://lkml.kernel.org/r/20200502003523.8204-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-14 10:00:35 -07:00
Jason Gunthorpe
2733ea144d mm/hmm: remove the customizable pfn format from hmm_range_fault
Presumably the intent here was that hmm_range_fault() could put the data
into some HW specific format and thus avoid some work. However, nothing
actually does that, and it isn't clear how anything actually could do that
as hmm_range_fault() provides CPU addresses which must be DMA mapped.

Perhaps there is some special HW that does not need DMA mapping, but we
don't have any examples of this, and the theoretical performance win of
avoiding an extra scan over the pfns array doesn't seem worth the
complexity. Plus pfns needs to be scanned anyhow to sort out any
DEVICE_PRIVATE pages.

This version replaces the uint64_t with an usigned long containing a pfn
and fixed flags. On input flags is filled with the HMM_PFN_REQ_* values,
on successful output it is filled with HMM_PFN_* values, describing the
state of the pages.

amdgpu is simple to convert, it doesn't use snapshot and doesn't use
per-page flags.

nouveau uses only 16 hmm_pte entries at most (ie fits in a few cache
lines), and it sweeps over its pfns array a couple of times anyhow. It
also has a nasty call chain before it reaches the dma map and hardware
suggesting performance isn't important:

   nouveau_svm_fault():
     args.i.m.method = NVIF_VMM_V0_PFNMAP
     nouveau_range_fault()
      nvif_object_ioctl()
       client->driver->ioctl()
	  struct nvif_driver nvif_driver_nvkm:
	    .ioctl = nvkm_client_ioctl
	   nvkm_ioctl()
	    nvkm_ioctl_path()
	      nvkm_ioctl_v0[type].func(..)
	      nvkm_ioctl_mthd()
	       nvkm_object_mthd()
		  struct nvkm_object_func nvkm_uvmm:
		    .mthd = nvkm_uvmm_mthd
		   nvkm_uvmm_mthd()
		    nvkm_uvmm_mthd_pfnmap()
		     nvkm_vmm_pfn_map()
		      nvkm_vmm_ptes_get_map()
		       func == gp100_vmm_pgt_pfn
			struct nvkm_vmm_desc_func gp100_vmm_desc_spt:
			  .pfn = gp100_vmm_pgt_pfn
			 nvkm_vmm_iter()
			  REF_PTES == func == gp100_vmm_pgt_pfn()
			    dma_map_page()

Link: https://lore.kernel.org/r/5-v2-b4e84f444c7d+24f57-hmm_no_flags_jgg@mellanox.com
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-11 10:47:29 -03:00
Jason Gunthorpe
5c8f3c4cf1 mm/hmm: remove HMM_PFN_SPECIAL
This is just an alias for HMM_PFN_ERROR, nothing cares that the error was
because of a special page vs any other error case.

Link: https://lore.kernel.org/r/4-v2-b4e84f444c7d+24f57-hmm_no_flags_jgg@mellanox.com
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-11 10:47:29 -03:00
Jason Gunthorpe
be957c886d mm/hmm: make hmm_range_fault return 0 or -1
hmm_vma_walk->last is supposed to be updated after every write to the
pfns, so that it can be returned by hmm_range_fault(). However, this is
not done consistently. Fortunately nothing checks the return code of
hmm_range_fault() for anything other than error.

More importantly last must be set before returning -EBUSY as it is used to
prevent reading an output pfn as an input flags when the loop restarts.

For clarity and simplicity make hmm_range_fault() return 0 or -ERRNO. Only
set last when returning -EBUSY.

Link: https://lore.kernel.org/r/2-v2-b4e84f444c7d+24f57-hmm_no_flags_jgg@mellanox.com
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-11 10:47:29 -03:00
Linus Torvalds
0a85ed6e7f Merge tag 'block-5.7-2020-05-09' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:

 - a small series fixing a use-after-free of bdi name (Christoph,Yufen)

 - NVMe fix for a regression with the smaller CQ update (Alexey)

 - NVMe fix for a hang at namespace scanning error recovery (Sagi)

 - fix race with blk-iocost iocg->abs_vdebt updates (Tejun)

* tag 'block-5.7-2020-05-09' of git://git.kernel.dk/linux-block:
  nvme: fix possible hang when ns scanning fails during error recovery
  nvme-pci: fix "slimmer CQ head update"
  bdi: add a ->dev_name field to struct backing_dev_info
  bdi: use bdi_dev_name() to get device name
  bdi: move bdi_dev_name out of line
  vboxsf: don't use the source name in the bdi name
  iocost: protect iocg->abs_vdebt with iocg->waitq.lock
2020-05-10 11:16:07 -07:00
Christoph Hellwig
1cd925d583 bdi: remove the name field in struct backing_dev_info
The name is only printed for a not registered bdi in writeback.  Use the
device name there as is more useful anyway for the unlike case that the
warning triggers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:15:13 -06:00
Christoph Hellwig
aef33c2ff8 bdi: simplify bdi_alloc
Merge the _node vs normal version and drop the superflous gfp_t argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:15:13 -06:00
Christoph Hellwig
3c5d202b55 bdi: remove bdi_register_owner
Split out a new bdi_set_owner helper to set the owner, and move the policy
for creating the bdi name back into genhd.c, where it belongs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:15:13 -06:00