Pull read/write fix from Al Viro:
"file_start_write()/file_end_write() got mixed into vfs_iter_write() by
accident; that's a deadlock for all existing callers - they already do
that, some - quite a bit outside.
Easily fixed, fortunately"
* 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
move file_{start,end}_write() out of do_iter_write()
To avoid pathological stack usage or the need to special-case setuid
execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB).
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull Writeback error handling updates from Jeff Layton:
"This pile represents the bulk of the writeback error handling fixes
that I have for this cycle. Some of the earlier patches in this pile
may look trivial but they are prerequisites for later patches in the
series.
The aim of this set is to improve how we track and report writeback
errors to userland. Most applications that care about data integrity
will periodically call fsync/fdatasync/msync to ensure that their
writes have made it to the backing store.
For a very long time, we have tracked writeback errors using two flags
in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
writeback error occurs (via mapping_set_error) and are cleared as a
side-effect of filemap_check_errors (as you noted yesterday). This
model really sucks for userland.
Only the first task to call fsync (or msync or fdatasync) will see the
error. Any subsequent task calling fsync on a file will get back 0
(unless another writeback error occurs in the interim). If I have
several tasks writing to a file and calling fsync to ensure that their
writes got stored, then I need to have them coordinate with one
another. That's difficult enough, but in a world of containerized
setups that coordination may even not be possible.
But wait...it gets worse!
The calls to filemap_check_errors can be buried pretty far down in the
call stack, and there are internal callers of filemap_write_and_wait
and the like that also end up clearing those errors. Many of those
callers ignore the error return from that function or return it to
userland at nonsensical times (e.g. truncate() or stat()). If I get
back -EIO on a truncate, there is no reason to think that it was
because some previous writeback failed, and a subsequent fsync() will
(incorrectly) return 0.
This pile aims to do three things:
1) ensure that when a writeback error occurs that that error will be
reported to userland on a subsequent fsync/fdatasync/msync call,
regardless of what internal callers are doing
2) report writeback errors on all file descriptions that were open at
the time that the error occurred. This is a user-visible change,
but I think most applications are written to assume this behavior
anyway. Those that aren't are unlikely to be hurt by it.
3) document what filesystems should do when there is a writeback
error. Today, there is very little consistency between them, and a
lot of cargo-cult copying. We need to make it very clear what
filesystems should do in this situation.
To achieve this, the set adds a new data type (errseq_t) and then
builds new writeback error tracking infrastructure around that. Once
all of that is in place, we change the filesystems to use the new
infrastructure for reporting wb errors to userland.
Note that this is just the initial foray into cleaning up this mess.
There is a lot of work remaining here:
1) convert the rest of the filesystems in a similar fashion. Once the
initial set is in, then I think most other fs' will be fairly
simple to convert. Hopefully most of those can in via individual
filesystem trees.
2) convert internal waiters on writeback to use errseq_t for
detecting errors instead of relying on the AS_* flags. I have some
draft patches for this for ext4, but they are not quite ready for
prime time yet.
This was a discussion topic this year at LSF/MM too. If you're
interested in the gory details, LWN has some good articles about this:
https://lwn.net/Articles/718734/https://lwn.net/Articles/724307/"
* tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
btrfs: minimal conversion to errseq_t writeback error reporting on fsync
xfs: minimal conversion to errseq_t writeback error reporting
ext4: use errseq_t based error handling for reporting data writeback errors
fs: convert __generic_file_fsync to use errseq_t based reporting
block: convert to errseq_t based writeback error tracking
dax: set errors in mapping when writeback fails
Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
fs: new infrastructure for writeback error handling and reporting
lib: add errseq_t type and infrastructure for handling it
mm: don't TestClearPageError in __filemap_fdatawait_range
mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
jbd2: don't clear and reset errors after waiting on writeback
buffer: set errors in mapping at the time that the error occurs
fs: check for writeback errors after syncing out buffers in generic_file_fsync
buffer: use mapping_set_error instead of setting the flag
mm: fix mapping_set_error call in me_pagecache_dirty
In quite a few places we call xfs_da_read_buf with a mappedbno that we
don't control, then assume that the function passes back either an error
code or a buffer pointer. Unfortunately, if mappedbno == -2 and bno
maps to a hole, we get a return code of zero and a NULL buffer, which
means that we crash if we actually try to use that buffer pointer. This
happens immediately when we set the buffer type for transaction context.
Therefore, check that we have no error code and a non-NULL bp before
trying to use bp. This patch is a follow-up to an incomplete fix in
96a3aefb8f ("xfs: don't crash if reading a directory results in an
unexpected hole").
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull Writeback error handling fixes from Jeff Layton:
"The main rationale for all of these changes is to tighten up writeback
error reporting to userland. There are many ways now that writeback
errors can be lost, such that fsync/fdatasync/msync return 0 when
writeback actually failed.
This pile contains a small set of cleanups and writeback error
handling fixes that I was able to break off from the main pile (#2).
Two of the patches in this pile are trivial. The exceptions are the
patch to fix up error handling in write_one_page, and the patch to
make JFS pay attention to write_one_page errors"
* tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: remove call_fsync helper function
mm: clean up error handling in write_one_page
JFS: do not ignore return code from write_one_page()
mm: drop "wait" parameter from write_one_page()
Pull cifs fixes from Steve French:
"First set of CIFS/SMB3 fixes for the merge window. Also improves POSIX
character mapping for SMB3"
* tag 'cifs-bug-fixes-for-4.13' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: fix circular locking dependency
cifs: set oparms.create_options rather than or'ing in CREATE_OPEN_BACKUP_INTENT
cifs: Do not modify mid entry after submitting I/O in cifs_call_async
CIFS: add SFM mapping for 0x01-0x1F
cifs: hide unused functions
cifs: Use smb 2 - 3 and cifsacl mount options getacl functions
cifs: prototype declaration and definition for smb 2 - 3 and cifsacl mount options
CIFS: add CONFIG_CIFS_DEBUG_KEYS to dump encryption keys
cifs: set mapping error when page writeback fails in writepage or launder_pages
SMB3: Enable encryption for SMB3.1.1
Pull GFS2 fix from Bob Peterson:
"Sorry for the additional merge request, but Andreas discovered this
problem soon after you processed our last gfs2 merge.
This fixes a regression introduced by a patch we did in mid-2015
(commit 88ffbf3e03: "GFS2: Use resizable hash table for glocks"), so
best to get it fixed. Some code was reverted that should not have
been.
The patch from Andreas Gruenbacher just re-adds code that had been
there originally"
* tag 'gfs2-4.13.fixes.addendum' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix glock rhashtable rcu bug
take_dentry_name_snapshot() takes a safe snapshot of dentry name;
if the name is a short one, it gets copied into caller-supplied
structure, otherwise an extra reference to external name is grabbed
(those are never modified). In either case the pointer to stable
string is stored into the same structure.
dentry must be held by the caller of take_dentry_name_snapshot(),
but may be freely dropped afterwards - the snapshot will stay
until destroyed by release_dentry_name_snapshot().
Intended use:
struct name_snapshot s;
take_dentry_name_snapshot(&s, dentry);
...
access s.name
...
release_dentry_name_snapshot(&s);
Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
to pass down with event.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Michael Ellerman reported that commit 8c6657cb50 ("Switch flock
copyin/copyout primitives to copy_{from,to}_user()") broke his
networking on a bunch of PPC machines (64-bit kernel, 32-bit userspace).
The reason is a brown-paper bug by that commit, which had the arguments
to "copy_flock_fields()" in the wrong order, breaking the compat
handling for file locking. Apparently very few people run 32-bit user
space on x86 any more, so the PPC people got the honor of noticing this
"feature".
Michael also sent a minimal diff that just changed the order of the
arguments in that macro.
This is not that minimal diff.
This not only changes the order of the arguments in the macro, it also
changes them to be pointers (to be consistent with all the other uses of
those pointers), and makes the functions that do all of this also have
the proper "const" attribution on the source pointers in order to make
issues like that (using the source as a destination) be really obvious.
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before commit 88ffbf3e03 "GFS2: Use resizable hash table for glocks",
glocks were freed via call_rcu to allow reading the glock hashtable
locklessly using rcu. This was then changed to free glocks immediately,
which made reading the glock hashtable unsafe. Bring back the original
code for freeing glocks via call_rcu.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Cc: stable@vger.kernel.org # 4.3+
In order to avoid lock contention for atomic written pages, we'd better give
EBUSY in f2fs_migrate_page when mode is asynchronous. We expect it will be
released soon as transaction commits.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we count all inode consumed blocks including inode block,
xattr block, index block, data block into i_blocks, for other generic
filesystems, they won't count inode block into i_blocks, so for
userspace applications or quota system, they may detect incorrect block
count according to i_blocks value in inode.
This patch changes to count all blocks into inode.i_blocks excluding
inode block, for on-disk i_blocks, we keep counting inode block for
backward compatibility.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Don't clear old mount option before parse new option during ->remount_fs
like other generic filesystems.
This reverts commit 26666c8a43.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After renaming a directory, fsck could detect unmatched pino. The scenario
can be reproduced as the following:
$ mkdir /bar/subbar /foo
$ rename /bar/subbar /foo
Then fsck will report:
[ASSERT] (__chk_dots_dentries:1182) --> Bad inode number[0x3] for '..', parent parent ino is [0x4]
Rename sets LOST_PINO for old_inode. However, the flag cannot be cleared,
since dir is written back with CP. So, let's get rid of LOST_PINO for a
renamed dir and fix the pino directly at the end of rename.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since directories will be written back with checkpoint and fsync a
directory will always write CP, there is no need to set LOST_PINO
after creating a directory.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Skip ->writepages in prior to ->writepage for {meta,node}_inode during
recovery, hence unneeded loop in ->writepages can be avoided.
Moreover, check SBI_POR_DOING earlier while writebacking pages.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After we introduce discard thread, discard command can be issued
concurrently with data allocating, this patch adds new function to
heck sit bitmap to ensure that userdata was invalid in which on-going
discard command covered.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch resolves kernel panic for xfstests/081, caused by recent f2fs_bug_on
f2fs: add f2fs_bug_on in __remove_discard_cmd
For fixing, we will stop gc/discard thread in prior in ->kill_sb in order to
avoid referring and releasing race among them.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this patch, we add a new sysfs interface, with it, we can control
number of reserved blocks in system which could not be used by user,
it enable f2fs to let user to configure for adjusting over-provision
ratio dynamically instead of changing it by mkfs.
So we can expect it will help to reserve more free space for relieving
GC in both filesystem and flash device.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
create_flush_cmd_control will create redundant issue_flush_thread after each
remount with flush_merge option.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If the partition is small, we don't need to report total # of inodes including
hidden free nodes.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull libnvdimm updates from Dan Williams:
"libnvdimm updates for the latest ACPI and UEFI specifications. This
pull request also includes new 'struct dax_operations' enabling to
undo the abuse of copy_user_nocache() for copy operations to pmem.
The dax work originally missed 4.12 to address concerns raised by Al.
Summary:
- Introduce the _flushcache() family of memory copy helpers and use
them for persistent memory write operations on x86. The
_flushcache() semantic indicates that the cache is either bypassed
for the copy operation (movnt) or any lines dirtied by the copy
operation are written back (clwb, clflushopt, or clflush).
- Extend dax_operations with ->copy_from_iter() and ->flush()
operations. These operations and other infrastructure updates allow
all persistent memory specific dax functionality to be pushed into
libnvdimm and the pmem driver directly. It also allows dax-specific
sysfs attributes to be linked to a host device, for example:
/sys/block/pmem0/dax/write_cache
- Add support for the new NVDIMM platform/firmware mechanisms
introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2
namespace label format, extensions to the address-range-scrub
command set, new error injection commands, and a new BTT
(block-translation-table) layout. These updates support inter-OS
and pre-OS compatibility.
- Fix a longstanding memory corruption bug in nfit_test.
- Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2)
capable.
- Miscellaneous fixes and small updates across libnvdimm and the nfit
driver.
Acknowledgements that came after the branch was pushed: commit
6aa734a2f3 ("libnvdimm, region, pmem: fix 'badblocks'
sysfs_get_dirent() reference lifetime") was reviewed by Toshi Kani
<toshi.kani@hpe.com>"
* tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (42 commits)
libnvdimm, namespace: record 'lbasize' for pmem namespaces
acpi/nfit: Issue Start ARS to retrieve existing records
libnvdimm: New ACPI 6.2 DSM functions
acpi, nfit: Show bus_dsm_mask in sysfs
libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru.
acpi, nfit: Enable DSM pass thru for root functions.
libnvdimm: passthru functions clear to send
libnvdimm, btt: convert some info messages to warn/err
libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime
libnvdimm: fix the clear-error check in nsio_rw_bytes
libnvdimm, btt: fix btt_rw_page not returning errors
acpi, nfit: quiet invalid block-aperture-region warnings
libnvdimm, btt: BTT updates for UEFI 2.7 format
acpi, nfit: constify *_attribute_group
libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
libnvdimm, pmem, dax: export a cache control attribute
dax: convert to bitmask for flags
dax: remove default copy_from_iter fallback
libnvdimm, nfit: enable support for volatile ranges
libnvdimm, pmem: fix persistence warning
...
XFS has a maximum symlink target length of 1024 bytes; this is a
holdover from the Irix days. Unfortunately, the constant establishing
this is 'MAXPATHLEN' and is /not/ the same as the Linux MAXPATHLEN,
which is 4096.
The kernel enforces its 1024 byte MAXPATHLEN on symlink targets, but
xfsprogs picks up the (Linux) system 4096 byte MAXPATHLEN, which means
that xfs_repair doesn't complain about oversized symlinks.
Since this is an on-disk format constraint, put the define in the XFS
namespace and move everything over to use the new name.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Current code does not update ceph_dentry_info::lease_session once
it is set. If auth mds of corresponding dentry changes, dentry lease
keeps in an invalid state.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Current ceph uses FSID as primary index key of fscache data. This
allows ceph to retain cached data across remount. But this causes
problem (kernel opps, fscache does not support sharing data) when
a filesystem get mounted several times (with fscache enabled, with
different mount options).
The fix is adding a new mount option, which specifies uniquifier
for fscache.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
extra_mon_dispatch() and debugfs' foo_show functions dereference
fsc->mdsc. we should clean up fsc->client->extra_mon_dispatch
and debugfs before destroying fsc->mds.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Previously we were returning values for quota, layout
xattrs without any kind of update -- the user just got
whatever happened to be in our cache.
Clearly this extra round trip has a cost, but reads of
these xattrs are fairly rare, happening on admin
intervention rather than in normal operation.
Link: http://tracker.ceph.com/issues/17939
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Don't re-send interrupted flock request in cases of mds failover
and receiving request forward. Because corresponding 'lock intr'
request may have been finished, it won't get re-sent.
Link: http://tracker.ceph.com/issues/20170
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Ceph needs to flush dirty page in the order in which in which snap
context they belong to. Dirty pages belong to older snap context
should be flushed earlier. if writepage_nounlock() can not flush a
page, it should redirty the page.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The old 'approaching max_size' code expects MDS set max_size to
'2 * reported_size'. This is no longer true. The new code reports
file size when half of previous max_size increment has been used.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The 'wanted max size' could be sent to inode's old auth mds, re-send
it to inode's new auth mds if necessary. Otherwise write syscall may
hang.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Merge misc updates from Andrew Morton:
- a few hotfixes
- various misc updates
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (108 commits)
mm, memory_hotplug: move movable_node to the hotplug proper
mm, memory_hotplug: drop CONFIG_MOVABLE_NODE
mm, memory_hotplug: drop artificial restriction on online/offline
mm: memcontrol: account slab stats per lruvec
mm: memcontrol: per-lruvec stats infrastructure
mm: memcontrol: use generic mod_memcg_page_state for kmem pages
mm: memcontrol: use the node-native slab memory counters
mm: vmstat: move slab statistics from zone to node counters
mm/zswap.c: delete an error message for a failed memory allocation in zswap_dstmem_prepare()
mm/zswap.c: improve a size determination in zswap_frontswap_init()
mm/zswap.c: delete an error message for a failed memory allocation in zswap_pool_create()
mm/swapfile.c: sort swap entries before free
mm/oom_kill: count global and memory cgroup oom kills
mm: per-cgroup memory reclaim stats
mm: kmemleak: treat vm_struct as alternative reference to vmalloc'ed objects
mm: kmemleak: factor object reference updating out of scan_block()
mm: kmemleak: slightly reduce the size of some structures on 64-bit architectures
mm, mempolicy: don't check cpuset seqlock where it doesn't matter
mm, cpuset: always use seqlock when changing task's nodemask
mm, mempolicy: simplify rebinding mempolicies when updating cpusets
...
Pull misc compat stuff updates from Al Viro:
"This part is basically untangling various compat stuff. Compat
syscalls moved to their native counterparts, getting rid of quite a
bit of double-copying and/or set_fs() uses. A lot of field-by-field
copyin/copyout killed off.
- kernel/compat.c is much closer to containing just the
copyin/copyout of compat structs. Not all compat syscalls are gone
from it yet, but it's getting there.
- ipc/compat_mq.c killed off completely.
- block/compat_ioctl.c cleaned up; floppy compat ioctls moved to
drivers/block/floppy.c where they belong. Yes, there are several
drivers that implement some of the same ioctls. Some are m68k and
one is 32bit-only pmac. drivers/block/floppy.c is the only one in
that bunch that can be built on biarch"
* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
mqueue: move compat syscalls to native ones
usbdevfs: get rid of field-by-field copyin
compat_hdio_ioctl: get rid of set_fs()
take floppy compat ioctls to sodding floppy.c
ipmi: get rid of field-by-field __get_user()
ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one
rt_sigtimedwait(): move compat to native
select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()
put_compat_rusage(): switch to copy_to_user()
sigpending(): move compat to native
getrlimit()/setrlimit(): move compat to native
times(2): move compat to native
compat_{get,put}_bitmap(): use unsafe_{get,put}_user()
fb_get_fscreeninfo(): don't bother with do_fb_ioctl()
do_sigaltstack(): lift copying to/from userland into callers
take compat_sys_old_getrlimit() to native syscall
trim __ARCH_WANT_SYS_OLD_GETRLIMIT
When doing an incremental send, while processing an extent that changed
between the parent and send snapshots and that extent was an inline extent
in the parent snapshot, it's possible to access a memory region beyond
the end of leaf if the inline extent is very small and it is the first
item in a leaf.
An example scenario is described below.
The send snapshot has the following leaf:
leaf 33865728 items 33 free space 773 generation 46 owner 5
fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
(...)
item 14 key (335 EXTENT_DATA 0) itemoff 3052 itemsize 53
generation 36 type 1 (regular)
extent data disk byte 12791808 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression 0 (none)
item 15 key (335 EXTENT_DATA 8192) itemoff 2999 itemsize 53
generation 36 type 1 (regular)
extent data disk byte 138170368 nr 225280
extent data offset 0 nr 225280 ram 225280
extent compression 0 (none)
(...)
And the parent snapshot has the following leaf:
leaf 31272960 items 17 free space 17 generation 31 owner 5
fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
item 0 key (335 EXTENT_DATA 0) itemoff 3951 itemsize 44
generation 31 type 0 (inline)
inline extent data size 23 ram_bytes 613 compression 1 (zlib)
(...)
When computing the send stream, it is detected that the extent of inode
335, at file offset 0, and at fs/btrfs/send.c:is_extent_unchanged() we
grab the leaf from the parent snapshot and access the inline extent item.
However, before jumping to the 'out' label, we access the 'offset' and
'disk_bytenr' fields of the extent item, which should not be done for
inline extents since the inlined data starts at the offset of the
'disk_bytenr' field and can be very small. For example accessing the
'offset' field of the file extent item results in the following trace:
[ 599.705368] general protection fault: 0000 [#1] PREEMPT SMP
[ 599.706296] Modules linked in: btrfs psmouse i2c_piix4 ppdev acpi_cpufreq serio_raw parport_pc i2c_core evdev tpm_tis tpm_tis_core sg pcspkr parport tpm button su$
[ 599.709340] CPU: 7 PID: 5283 Comm: btrfs Not tainted 4.10.0-rc8-btrfs-next-46+ #1
[ 599.709340] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 599.709340] task: ffff88023eedd040 task.stack: ffffc90006658000
[ 599.709340] RIP: 0010:read_extent_buffer+0xdb/0xf4 [btrfs]
[ 599.709340] RSP: 0018:ffffc9000665ba00 EFLAGS: 00010286
[ 599.709340] RAX: db73880000000000 RBX: 0000000000000000 RCX: 0000000000000001
[ 599.709340] RDX: ffffc9000665ba60 RSI: db73880000000000 RDI: ffffc9000665ba5f
[ 599.709340] RBP: ffffc9000665ba30 R08: 0000000000000001 R09: ffff88020dc5e098
[ 599.709340] R10: 0000000000001000 R11: 0000160000000000 R12: 6db6db6db6db6db7
[ 599.709340] R13: ffff880000000000 R14: 0000000000000000 R15: ffff88020dc5e088
[ 599.709340] FS: 00007f519555a8c0(0000) GS:ffff88023f3c0000(0000) knlGS:0000000000000000
[ 599.709340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 599.709340] CR2: 00007f1411afd000 CR3: 0000000235f8e000 CR4: 00000000000006e0
[ 599.709340] Call Trace:
[ 599.709340] btrfs_get_token_64+0x93/0xce [btrfs]
[ 599.709340] ? printk+0x48/0x50
[ 599.709340] btrfs_get_64+0xb/0xd [btrfs]
[ 599.709340] process_extent+0x3a1/0x1106 [btrfs]
[ 599.709340] ? btree_read_extent_buffer_pages+0x5/0xef [btrfs]
[ 599.709340] changed_cb+0xb03/0xb3d [btrfs]
[ 599.709340] ? btrfs_get_token_32+0x7a/0xcc [btrfs]
[ 599.709340] btrfs_compare_trees+0x432/0x53d [btrfs]
[ 599.709340] ? process_extent+0x1106/0x1106 [btrfs]
[ 599.709340] btrfs_ioctl_send+0x960/0xe26 [btrfs]
[ 599.709340] btrfs_ioctl+0x181b/0x1fed [btrfs]
[ 599.709340] ? trace_hardirqs_on_caller+0x150/0x1ac
[ 599.709340] vfs_ioctl+0x21/0x38
[ 599.709340] ? vfs_ioctl+0x21/0x38
[ 599.709340] do_vfs_ioctl+0x611/0x645
[ 599.709340] ? rcu_read_unlock+0x5b/0x5d
[ 599.709340] ? __fget+0x6d/0x79
[ 599.709340] SyS_ioctl+0x57/0x7b
[ 599.709340] entry_SYSCALL_64_fastpath+0x18/0xad
[ 599.709340] RIP: 0033:0x7f51945eec47
[ 599.709340] RSP: 002b:00007ffc21c13e98 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 599.709340] RAX: ffffffffffffffda RBX: ffffffff81096459 RCX: 00007f51945eec47
[ 599.709340] RDX: 00007ffc21c13f20 RSI: 0000000040489426 RDI: 0000000000000004
[ 599.709340] RBP: ffffc9000665bf98 R08: 00007f519450d700 R09: 00007f519450d700
[ 599.709340] R10: 00007f519450d9d0 R11: 0000000000000202 R12: 0000000000000046
[ 599.709340] R13: ffffc9000665bf78 R14: 0000000000000000 R15: 00007f5195574040
[ 599.709340] ? trace_hardirqs_off_caller+0x43/0xb1
[ 599.709340] Code: 29 f0 49 39 d8 4c 0f 47 c3 49 03 81 58 01 00 00 44 89 c1 4c 01 c2 4c 29 c3 48 c1 f8 03 49 0f af c4 48 c1 e0 0c 4c 01 e8 48 01 c6 <f3> a4 31 f6 4$
[ 599.709340] RIP: read_extent_buffer+0xdb/0xf4 [btrfs] RSP: ffffc9000665ba00
[ 599.762057] ---[ end trace fe00d7af61b9f49e ]---
This is because the 'offset' field starts at an offset of 37 bytes
(offsetof(struct btrfs_file_extent_item, offset)), has a length of 8
bytes and therefore attemping to read it causes a 1 byte access beyond
the end of the leaf, as the first item's content in a leaf is located
at the tail of the leaf, the item size is 44 bytes and the offset of
that field plus its length (37 + 8 = 45) goes beyond the item's size
by 1 byte.
So fix this by accessing the 'offset' and 'disk_bytenr' fields after
jumping to the 'out' label if we are processing an inline extent. We
move the reading operation of the 'disk_bytenr' field too because we
have the same problem as for the 'offset' field explained above when
the inline data is less then 8 bytes. The access to the 'generation'
field is also moved but just for the sake of grouping access to all
the fields.
Fixes: e1cbfd7bf6 ("Btrfs: send, fix file hole not being preserved due to inline extent")
Cc: <stable@vger.kernel.org> # v4.12+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
In some scenarios an incremental send stream can contain link commands
with an invalid target path. Such scenarios happen after moving some
directory inode A, renaming a regular file inode B into the old name of
inode A and finally creating a new hard link for inode B at directory
inode A.
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
| |--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- file1 (ino 261)
| |--- dir4/ (ino 262)
|
|--- dir5/ (ino 260)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- dir3/ (ino 259)
| |--- dir4 (ino 261)
|
|--- dir6/ (ino 263)
|--- dir44/ (ino 262)
|--- file11 (ino 261)
|--- dir55/ (ino 260)
When attempting to apply the corresponding incremental send stream, a
link command contains an invalid target path which makes the receiver
fail. The following is the verbose output of the btrfs receive command:
receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
utimes
utimes dir1
utimes dir1/dir2/dir3
utimes
rename dir1/dir2/dir3/dir4 -> o262-7-0
link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) When processing inode 261, we orphanize inode 262 due to a name/location
collision with one of the new hard links for inode 261 (created in the
second step below).
2) We create one of the 2 new hard links for inode 261, the one whose
location is at "dir1/dir2/dir3/dir4".
3) We then attempt to create the other new hard link for inode 261, which
has inode 262 as its parent directory. Because the path for this new
hard link was computed before we started processing the new references
(hard links), it reflects the old name/location of inode 262, that is,
it does not account for the orphanization step that happened when
we started processing the new references for inode 261, whence it is
no longer valid, causing the receiver to fail.
So fix this issue by recomputing the full path of new references if we
ended up orphanizing other inodes which are directories.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Trivial fix to spelling mistake in mb_debug debug message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>