No real change in functionality, but the old interface seems to be
deprecated.
We don't actually care about ordering necessarily, but we do depend on
running at most one work item at a time: nfsd4_process_cb_update()
assumes that no other thread is running it, and that no new callbacks
are starting while it's running.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If there is an error reported in mballoc via ext4_grp_locked_error(),
the code is holding a spinlock, so ext4_commit_super() must not try to
lock the buffer head, or else it will trigger a BUG:
BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:358
in_atomic(): 1, irqs_disabled(): 0, pid: 993, name: mount
CPU: 0 PID: 993 Comm: mount Not tainted 4.9.0-rc1-clouder1 #62
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
ffff880006423548 ffffffff81318c89 ffffffff819ecdd0 0000000000000166
ffff880006423558 ffffffff810810b0 ffff880006423580 ffffffff81081153
ffff880006e5a1a0 ffff88000690e400 0000000000000000 ffff8800064235c0
Call Trace:
[<ffffffff81318c89>] dump_stack+0x67/0x9e
[<ffffffff810810b0>] ___might_sleep+0xf0/0x140
[<ffffffff81081153>] __might_sleep+0x53/0xb0
[<ffffffff8126c1dc>] ext4_commit_super+0x19c/0x290
[<ffffffff8126e61a>] __ext4_grp_locked_error+0x14a/0x230
[<ffffffff81081153>] ? __might_sleep+0x53/0xb0
[<ffffffff812822be>] ext4_mb_generate_buddy+0x1de/0x320
Since ext4_grp_locked_error() calls ext4_commit_super with sync == 0
(and it is the only caller which does so), avoid locking and unlocking
the buffer in this case.
This can result in races with ext4_commit_super() if there are other
problems (which is what commit 4743f83990 was trying to address),
but a Warning is better than BUG.
Fixes: 4743f83990
Cc: stable@vger.kernel.org # 4.9
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Return errors to the caller instead of declaring the file system
corrupted.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
This allows us to properly propagate errors back up to
ext4_truncate()'s callers. This also means we no longer have to
silently ignore some errors (e.g., when trying to add the inode to the
orphan inode list).
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
With the new (in 4.9) option to use a virtually-mapped stack
(CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for
the scatterlist crypto API because they may not be directly mappable to
struct page. get_crypt_info() was using a stack buffer to hold the
output from the encryption operation used to derive the per-file key.
Fix it by using a heap buffer.
This bug could most easily be observed in a CONFIG_DEBUG_SG kernel
because this allowed the BUG in sg_set_buf() to be triggered.
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
With the new (in 4.9) option to use a virtually-mapped stack
(CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for
the scatterlist crypto API because they may not be directly mappable to
struct page. For short filenames, fname_encrypt() was encrypting a
stack buffer holding the padded filename. Fix it by encrypting the
filename in-place in the output buffer, thereby making the temporary
buffer unnecessary.
This bug could most easily be observed in a CONFIG_DEBUG_SG kernel
because this allowed the BUG in sg_set_buf() to be triggered.
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Avoid re-use of page index as tweak for AES-XTS when multiple parts of
same page are encrypted. This will happen on multiple (partial) calls of
fscrypt_encrypt_page on same page.
page->index is only valid for writeback pages.
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Some filesystems, such as UBIFS, maintain a const pointer for struct
inode.
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Not all filesystems work on full pages, thus we should allow them to
hand partial pages to fscrypt for en/decryption.
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Some filesystem might pass pages which do not have page->mapping->host
set to the encrypted inode. We want the caller to explicitly pass the
corresponding inode.
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4 and f2fs require a bounce page when encrypting pages. However, not
all filesystems will need that (eg. UBIFS). This is handled via a
flag on fscrypt_operations where a fs implementation can select in-place
encryption over using a bounce page (which is the default).
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
jfs uses nanosecond granularity for filesystem timestamps.
Only this assignment is not using nanosecond granularity.
Use current_time() to get the right granularity.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
The poll code is blk-mq specific, let's move it to blk-mq.c. This
is a prep patch for improving the polling code.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently pstore has a global spinlock for all zones. Since the zones
are independent and modify different areas of memory, there's no need
to have a global lock, so we should use a per-zone lock as introduced
here. Also, when ramoops's ftrace use-case has a FTRACE_PER_CPU flag
introduced later, which splits the ftrace memory area into a single zone
per CPU, it will eliminate the need for locking. In preparation for this,
make the locking optional.
Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: updated commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
Pull VFS fixes from Al Viro:
"Christoph's and Jan's aio fixes, fixup for generic_file_splice_read
(removal of pointless detritus that actually breaks it when used for
gfs2 ->splice_read()) and fixup for generic_file_read_iter()
interaction with ITER_PIPE destinations."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
splice: remove detritus from generic_file_splice_read()
mm/filemap: don't allow partially uptodate page for pipes
aio: fix freeze protection of aio writes
fs: remove aio_run_iocb
fs: remove the never implemented aio_fsync file operation
aio: hold an extra file reference over AIO read/write operations
Pull Ceph fixes from Ilya Dryomov:
"Ceph's ->read_iter() implementation is incompatible with the new
generic_file_splice_read() code that went into -rc1. Switch to the
less efficient default_file_splice_read() for now; the proper fix is
being held for 4.10.
We also have a fix for a 4.8 regression and a trival libceph fixup"
* tag 'ceph-for-4.9-rc5' of git://github.com/ceph/ceph-client:
libceph: initialize last_linger_id with a large integer
libceph: fix legacy layout decode with pool 0
ceph: use default file splice read callback
Pull NFS client bugfixes from Anna Schumaker:
"Most of these fix regressions in 4.9, and none are going to stable
this time around.
Bugfixes:
- Trim extra slashes in v4 nfs_paths to fix tools that use this
- Fix a -Wmaybe-uninitialized warnings
- Fix suspicious RCU usages
- Fix Oops when mounting multiple servers at once
- Suppress a false-positive pNFS error
- Fix a DMAR failure in NFS over RDMA"
* tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
xprtrdma: Fix DMAR failure in frwr_op_map() after reconnect
fs/nfs: Fix used uninitialized warn in nfs4_slot_seqid_in_use()
NFS: Don't print a pNFS error if we aren't using pNFS
NFS: Ignore connections that have cl_rpcclient uninitialized
SUNRPC: Fix suspicious RCU usage
NFSv4.1: work around -Wmaybe-uninitialized warning
NFS: Trim extra slash in v4 nfs_path
Pull xfs fix from Dave Chinner:
"This is a fix for an unmount hang (regression) when the filesystem is
shutdown. It was supposed to go to you for -rc3, but I accidentally
tagged the commit prior to it in that pullreq.
Summary:
- fix for aborting deferred transactions on filesystem shutdown"
* tag 'xfs-fixes-for-linus-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: defer should abort intent items if the trans roll fails
It could be not possible to freeze coredumping task when it waits for
'core_state->startup' completion, because threads are frozen in
get_signal() before they got a chance to complete 'core_state->startup'.
Inability to freeze a task during suspend will cause suspend to fail.
Also CRIU uses cgroup freezer during dump operation. So with an
unfreezable task the CRIU dump will fail because it waits for a
transition from 'FREEZING' to 'FROZEN' state which will never happen.
Use freezer_do_not_count() to tell freezer to ignore coredumping task
while it waits for core_state->startup completion.
Link: http://lkml.kernel.org/r/1475225434-3753-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
i_size check is a leftover from the horrors that used to play with
the page cache in that function. With the switch to ->read_iter(),
it's neither needed nor correct - for gfs2 it ends up being buggy,
since i_size is not guaranteed to be correct until later (inside
->read_iter()).
Spotted-by: Abhi Das <adas@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Splice read/write implementation changed recently. When using
generic_file_splice_read(), iov_iter with type == ITER_PIPE is
passed to filesystem's read_iter callback. But ceph_sync_read()
can't serve ITER_PIPE iov_iter correctly (ITER_PIPE iov_iter
expects pages from page cache).
Fixing ceph_sync_read() requires a big patch. So use default
splice read callback for now.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Introduce a flag telling iomap operations whether they are handling a
fault or other IO. That may influence behavior wrt inode size and
similar things.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Filesystem shutdown testing on an older distro kernel has uncovered an
imbalanced locking pattern for the inode flush lock in
xfs_reclaim_inode(). Specifically, there is a double unlock sequence
between the call to xfs_iflush_abort() and xfs_reclaim_inode() at the
"reclaim:" label.
This actually does not cause obvious problems on current kernels due to
the current flush lock implementation. Older kernels use a counting
based flush lock mechanism, however, which effectively breaks the lock
indefinitely when an already unlocked flush lock is repeatedly unlocked.
Though this only currently occurs on filesystem shutdown, it has
reproduced the effect of elevating an fs shutdown to a system-wide crash
or hang.
As it turns out, the flush lock is not actually required for the reclaim
logic in xfs_reclaim_inode() because by that time we have already cycled
the flush lock once while holding ILOCK_EXCL. Therefore, remove the
additional flush lock/unlock cycle around the 'reclaim:' label and
update branches into this label to release the flush lock where
appropriate. Add an assert to xfs_ifunlock() to help prevent future
occurences of the same problem.
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Pull orangefs fix from Mike Marshall:
"We recently refactored the Orangefs debugfs code. The refactor seemed
to trigger dan.carpenter@oracle.com's static tester to find a possible
double-free in the code.
While designing the fix we saw a condition under which the buffer
being freed could also be overflowed.
We also realized how to rebuild the related debugfs file's "contents"
(a string) without deleting and re-creating the file.
This fix should eliminate the possible double-free, the potential
overflow and improve code readability"
* tag 'for-linus-4.9-rc4-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: clean up debugfs
Check the minimum block size on v5 filesystems.
[dchinner: cleaned up XFS_MIN_CRC_BLOCKSIZE check]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The open-coded pattern:
ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)
is all over the xfs code; provide a new helper
xfs_iext_count(ifp) to count the number of inline extents
in an inode fork.
[dchinner: pick up several missed conversions]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There have been several reports over the years of NULL pointer
dereferences in xfs_trans_log_inode during xfs_fsr processes,
when the process is doing an fput and tearing down extents
on the temporary inode, something like:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
PID: 29439 TASK: ffff880550584fa0 CPU: 6 COMMAND: "xfs_fsr"
[exception RIP: xfs_trans_log_inode+0x10]
#9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
#10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
#11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
#12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
#13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
#14 [ffff8800a57bbe00] evict at ffffffff811e1b67
#15 [ffff8800a57bbe28] iput at ffffffff811e23a5
#16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
#17 [ffff8800a57bbe88] dput at ffffffff811dd06c
#18 [ffff8800a57bbea8] __fput at ffffffff811c823b
#19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
#20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
#21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
#22 [ffff8800a57bbf50] int_signal at ffffffff8161405d
As it turns out, this is because the i_itemp pointer, along
with the d_ops pointer, has been overwritten with zeros
when we tear down the extents during truncate. When the in-core
inode fork on the temporary inode used by xfs_fsr was originally
set up during the extent swap, we mistakenly looked at di_nextents
to determine whether all extents fit inline, but this misses extents
generated by speculative preallocation; we should be using if_bytes
instead.
This mistake corrupts the in-memory inode, and code in
xfs_iext_remove_inline eventually gets bad inputs, causing
it to memmove and memset incorrect ranges; this became apparent
because the two values in ifp->if_u2.if_inline_ext[1] contained
what should have been in d_ops and i_itemp; they were memmoved due
to incorrect array indexing and then the original locations
were zeroed with memset, again due to an array overrun.
Fix this by properly using i_df.if_bytes to determine the number
of extents, not di_nextents.
Thanks to dchinner for looking at this with me and spotting the
root cause.
Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We've had reports of generic/095 causing XFS to BUG() in
__xfs_get_blocks() due to the existence of delalloc blocks on a
direct I/O read. generic/095 issues a mix of various types of I/O,
including direct and memory mapped I/O to a single file. This is
clearly not supported behavior and is known to lead to such
problems. E.g., the lack of exclusion between the direct I/O and
write fault paths means that a write fault can allocate delalloc
blocks in a region of a file that was previously a hole after the
direct read has attempted to flush/inval the file range, but before
it actually reads the block mapping. In turn, the direct read
discovers a delalloc extent and cannot proceed.
While the appropriate solution here is to not mix direct and memory
mapped I/O to the same regions of the same file, the current
BUG_ON() behavior is probably overkill as it can crash the entire
system. Instead, localize the failure to the I/O in question by
returning an error for a direct I/O that cannot be handled safely
due to delalloc blocks. Be careful to allow the case of a direct
write to post-eof delalloc blocks. This can occur due to speculative
preallocation and is safe as post-eof blocks are not accompanied by
dirty pages in pagecache (conversely, preallocation within eof must
have been zeroed, and thus dirtied, before the inode size could have
been increased beyond said blocks).
Finally, provide an additional warning if a direct I/O write occurs
while the file is memory mapped. This may not catch all problematic
scenarios, but provides a hint that some known-to-be-problematic I/O
methods are in use.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The cowblocks background scanner currently clears the cowblocks tag
for inodes without any real allocations in the cow fork. This
excludes inodes with only delalloc blocks in the cow fork. While we
might never expect to clear delalloc blocks from the cow fork in the
background scanner, it is not necessarily correct to clear the
cowblocks tag from such inodes.
For example, if the background scanner happens to process an inode
between a buffered write and writeback, the scanner catches the
inode in a state after delalloc blocks have been allocated to the
cow fork but before the delalloc blocks have been converted to real
blocks by writeback. The background scanner then incorrectly clears
the cowblocks tag, even if part of the aforementioned delalloc
reservation will not be remapped to the data fork (i.e., extra
blocks due to the cowextsize hint). This means that any such
additional blocks in the cow fork might never be reclaimed by the
background scanner and could persist until the inode itself is
reclaimed.
To address this problem, only skip and clear inodes without any cow
fork allocations whatsoever from the background scanner. While we
generally do not want to cancel delalloc reservations from the
background scanner, the pagecache dirty check following the
cowblocks check should prevent that situation. If we do end up with
delalloc cow fork blocks without a dirty address space mapping, this
is probably an indication that something has gone wrong and the
blocks should be reclaimed, as they may never be converted to a real
allocation.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Move the declaration of _dir_ino_validate out of the private
dir2 header file into the public one, since xfsprogs did that
for the benefit of xfs_repair.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Source xfsprogs commit: ee3754254e8c186c99b6cdd4d59f741759d04acb
Kernel commit 5ef828c4 ("xfs: avoid false quotacheck after unclean
shutdown") made xfs_sb_from_disk() also call xfs_sb_quota_from_disk
by default.
However, when this was merged to libxfs, existing separate
calls to libxfs_sb_quota_from_disk remained, and calling it
twice in a row on a V4 superblock leads to issues, because:
if (sbp->sb_qflags & XFS_PQUOTA_ACCT) {
...
sbp->sb_pquotino = sbp->sb_gquotino;
sbp->sb_gquotino = NULLFSINO;
and after the second call, we have set both pquotino and gquotino
to NULLFSINO.
Fix this by making it safe to call twice, and also remove the extra
calls to libxfs_sb_quota_from_disk.
This is only spotted when running xfstests with "-m crc=0" because
the sb_from_disk change came about after V5 became default, and
the above behavior only exists on a V4 superblock.
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Refactor the implementations of xfs_dir2_data_freescan into a
routine that takes the raw directory block parameters and
a second function that figures out the raw parameters from the
directory inode. This enables us to use the exact same code
for both userspace and the kernel, since repair knows exactly
which directory block geometry parameters it needs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Change the xfs_attr_shortform_bytesfit declaration to have
struct xfs_inode to avoid tripping up the libxfs-diff scanner.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The userspace version of _dinode_verify takes a raw inode number
instead of an inode itself. Since neither version actually needs
the inode, port the changes to the kernel. This will also reduce
the libxfs diff noise.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
improved dax_iomap_pmd_fault(). Also, now that it has no more users,
remove xfs_get_blocks_dax_fault().
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking. This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.
There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries. The empty entries exist to provide locking for the duration of a
given page fault.
This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.
Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set. This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees. This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.
One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset. The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases. This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.
If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry. We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>