When an NFSv4 client performs a mount operation, it first mounts the
NFSv4 root and then does path walk to the exported path and performs a
submount on that, cloning the security mount options from the root's
superblock to the submount's superblock in the process.
Unless the NFS server has an explicit fsid=0 export with the
"security_label" option, the NFSv4 root superblock will not have
SBLABEL_MNT set, and neither will the submount superblock after cloning
the security mount options. As a result, setxattr's of security labels
over NFSv4.2 will fail. In a similar fashion, NFSv4.2 mounts mounted
with the context= mount option will not show the correct labels because
the nfs_server->caps flags of the cloned superblock will still have
NFS_CAP_SECURITY_LABEL set.
Allowing the NFSv4 client to enable or disable SECURITY_LSM_NATIVE_LABELS
behavior will ensure that the SBLABEL_MNT flag has the correct value
when the client traverses from an exported path without the
"security_label" option to one with the "security_label" option and
vice versa. Similarly, checking to see if SECURITY_LSM_NATIVE_LABELS is
set upon return from security_sb_clone_mnt_opts() and clearing
NFS_CAP_SECURITY_LABEL if necessary will allow the correct labels to
be displayed for NFSv4.2 mounts mounted with the context= mount option.
Resolves: https://github.com/SELinuxProject/selinux-kernel/issues/35
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Stephen Smalley <sds@tycho.nsa.gov>
Tested-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Paul Moore <paul@paul-moore.com>
btrfs_calc_trans_metadata_size() does an unsigned 32-bit multiplication,
which can overflow if num_items >= 4 GB / (nodesize * BTRFS_MAX_LEVEL * 2).
For a nodesize of 16kB, this overflow happens at 16k items. Usually,
num_items is a small constant passed to btrfs_start_transaction(), but
we also use btrfs_calc_trans_metadata_size() for metadata reservations
for extent items in btrfs_delalloc_{reserve,release}_metadata().
In drop_outstanding_extents(), num_items is calculated as
inode->reserved_extents - inode->outstanding_extents. The difference
between these two counters is usually small, but if many delalloc
extents are reserved and then the outstanding extents are merged in
btrfs_merge_extent_hook(), the difference can become large enough to
overflow in btrfs_calc_trans_metadata_size().
The overflow manifests itself as a leak of a multiple of 4 GB in
delalloc_block_rsv and the metadata bytes_may_use counter. This in turn
can cause early ENOSPC errors. Additionally, these WARN_ONs in
extent-tree.c will be hit when unmounting:
WARN_ON(fs_info->delalloc_block_rsv.size > 0);
WARN_ON(fs_info->delalloc_block_rsv.reserved > 0);
WARN_ON(space_info->bytes_pinned > 0 ||
space_info->bytes_reserved > 0 ||
space_info->bytes_may_use > 0);
Fix it by casting nodesize to a u64 so that
btrfs_calc_trans_metadata_size() does a full 64-bit multiplication.
While we're here, do the same in btrfs_calc_trunc_metadata_size(); this
can't overflow with any existing uses, but it's better to be safe here
than have another hard-to-debug problem later on.
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Before this, we use 'filled' mode here, ie. if all range has been
filled with EXTENT_DEFRAG bits, get to clear it, but if the defrag
range joins the adjacent delalloc range, then we'll have EXTENT_DEFRAG
bits in extent_state until releasing this inode's pages, and that
prevents extent_data from being freed.
This clears the bit if any was found within the ordered extent.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
In verify_dir_item, it wants to printk name_len of dir_item but
printk data_len acutally.
Fix it by calling btrfs_dir_name_len instead of btrfs_dir_data_len.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Once we move the block layer to its own status code we'll still want to
propagate the bio_iov_iter_get_pages, so restructure __blkdev_direct_IO
to take ret into account when returning the errno.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Only read bio->bi_error once in the common path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In order to avoid future diversions between fs/compat_ioctl.c and
drivers/tty/pty.c, define .compat_ioctl callbacks for the relevant
tty_operations structs. Since both pty_unix98_ioctl() and
pty_bsd_ioctl() are compatible between 32-bit and 64-bit userspace no
special translation is required.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Before the patch, the flock flag could remain uninitialized for the
lifespan of the fuse_file allocation. Unless set to true in
fuse_file_flock(), it would remain in an indeterminate state until read in
an if statement in fuse_release_common(). This could consequently lead to
taking an unexpected branch in the code.
The bug was discovered by a runtime instrumentation designed to detect use
of uninitialized memory in the kernel.
Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com>
Fixes: 37fb3a30b4 ("fuse: fix flock")
Cc: <stable@vger.kernel.org> # v3.1+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The 0-day kernel test robot reports assertion failures on
!CONFIG_SMP kernels due to failed spin_is_locked() checks. As it
turns out, spin_is_locked() is hardcoded to return zero on
!CONFIG_SMP kernels and so this function cannot be relied on to
verify spinlock state in this configuration.
To avoid this problem, replace the associated asserts with lockdep
variants that do the right thing regardless of kernel configuration.
Drop the one assert that checks for an unlocked lock as there is no
suitable lockdep variant for that case. This moves the spinlock
checks from XFS debug code to lockdep, but generally provides the
same level of protection.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
On sparc, if we have an alloca() like situation, as is the case with
SHASH_DESC_ON_STACK(), we can end up referencing deallocated stack
memory. The result can be that the value is clobbered if a trap
or interrupt arrives at just the right instruction.
It only occurs if the function ends returning a value from that
alloca() area and that value can be placed into the return value
register using a single instruction.
For example, in lib/libcrc32c.c:crc32c() we end up with a return
sequence like:
return %i7+8
lduw [%o5+16], %o0 ! MEM[(u32 *)__shash_desc.1_10 + 16B],
%o5 holds the base of the on-stack area allocated for the shash
descriptor. But the return released the stack frame and the
register window.
So if an intererupt arrives between 'return' and 'lduw', then
the value read at %o5+16 can be corrupted.
Add a data compiler barrier to work around this problem. This is
exactly what the gcc fix will end up doing as well, and it absolutely
should not change the code generated for other cpus (unless gcc
on them has the same bug :-)
With crucial insight from Eric Sandeen.
Cc: <stable@vger.kernel.org>
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Provide a control message that can be specified on the first sendmsg() of a
client call or the first sendmsg() of a service response to indicate the
total length of the data to be transmitted for that call.
Currently, because the length of the payload of an encrypted DATA packet is
encrypted in front of the data, the packet cannot be encrypted until we
know how much data it will hold.
By specifying the length at the beginning of the transmit phase, each DATA
packet length can be set before we start loading data from userspace (where
several sendmsg() calls may contribute to a particular packet).
An error will be returned if too little or too much data is presented in
the Tx phase.
Signed-off-by: David Howells <dhowells@redhat.com>
Now that we have umask support, we shouldn't re-send the mode in a SETATTR
following an exclusive CREATE, or we risk having the same problem fixed in
commit 5334c5bdac ("NFS: Send attributes in OPEN request for
NFS4_CREATE_EXCLUSIVE4_1"), which is that files with S_ISGID will have that
bit stripped away.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: dff25ddb48 ("nfs: add support for the umask attribute")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers. More to come..
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Use the common helper uuid_is_null() and remove the xfs specific
helper uuid_is_nil().
The common helper does not check for the NULL pointer value as
xfs helper did, but xfs code never calls the helper with a pointer
that can be NULL.
Conform comments and warning strings to use the term 'null uuid'
instead of 'nil uuid', because this is the terminology used by
lib/uuid.c and its users. It is also the terminology used in
userspace by libuuid and xfsprogs.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
[hch: remove now unused uuid.[ch]]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Opencode uuid_getnodeuniq in the only caller, and directly decode
the uuid_t representation instead of using a structure cast for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
These helper are used to compare and copy two uuid_t type objects.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
[hch: also provide the respective guid_ versions]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Our "little endian" UUID really is a Wintel GUID, so rename it and its
helpers such (guid_t). The big endian UUID is the only true one, so
give it the name uuid_t. The uuid_le and uuid_be names are retained for
now, but will hopefully go away soon. The exception to that are the _cmp
helpers that will be replaced by better primitives ASAP and thus don't
get the new names.
Also the _to_bin helpers are named to match the better named uuid_parse
routine in userspace.
Also remove the existing typedef in XFS that's now been superceeded by
the generic type name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[andy: also update the UUID_LE/UUID_BE macros including fallout]
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This essentially is a partial revert of commit ff548773
("afs: Move UUID struct to linux/uuid.h") and moves struct uuid_v1 back into
fs/afs as struct afs_uuid. It however keeps it as big endian structure
so that we can use the normal uuid generation helpers when casting to/from
struct afs_uuid.
The V1 uuid intrepretation in struct form isn't really useful to the
rest of the kernel, and not really compatible to it either, so move it
back to AFS instead of polluting the global uuid.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Howells <dhowells@redhat.com>
This fixes a problem with reading files larger than 2GB from a UFS-2
file system:
https://bugzilla.kernel.org/show_bug.cgi?id=195721
The incorrect UFS s_maxsize limit became a problem as of commit
c2a9737f45 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
which started using s_maxbytes to avoid a page index overflow in
do_generic_file_read().
That caused files to be truncated on UFS-2 file systems because the
default maximum file size is 2GB (MAX_NON_LFS) and UFS didn't update it.
Here I simply increase the default to a common value used by other file
systems.
Signed-off-by: Richard Narron <comet.berkeley@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Will B <will.brokenbourgh2877@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@vger.kernel.org> # v4.9 and backports of c2a9737f45
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull NFS client bugfixes from Trond Myklebust:
"Bugfixes include:
- Fix a typo in commit e092693443 ("NFS append COMMIT after
synchronous COPY") that breaks copy offload
- Fix the connect error propagation in xs_tcp_setup_socket()
- Fix a lock leak in nfs40_walk_client_list
- Verify that pNFS requests lie within the offset range of the layout
segment"
* tag 'nfs-for-4.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: Mark unnecessarily extern functions as static
SUNRPC: ensure correct error is reported by xs_tcp_setup_socket()
NFSv4.0: Fix a lock leak in nfs40_walk_client_list
pnfs: Fix the check for requests in range of layout segment
xprtrdma: Delete an error message for a failed memory allocation in xprt_rdma_bc_setup()
pNFS/flexfiles: missing error code in ff_layout_alloc_lseg()
NFS fix COMMIT after COPY
nfs_initialise_sb() and nfs_clone_super() are declared as extern even
though they are used only in fs/nfs/super.c. Mark them as static.
Also remove explicit 'inline' directive from nfs_initialise_sb() and
leave it upto compiler to decide whether inlining is worth it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Merge misc fixes from Andrew Morton:
"15 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
scripts/gdb: make lx-dmesg command work (reliably)
mm: consider memblock reservations for deferred memory initialization sizing
mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified
mlock: fix mlock count can not decrease in race condition
mm/migrate: fix refcount handling when !hugepage_migration_supported()
dax: fix race between colliding PMD & PTE entries
mm: avoid spurious 'bad pmd' warning messages
mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks once
pcmcia: remove left-over %Z format
slub/memcg: cure the brainless abuse of sysfs attributes
initramfs: fix disabling of initramfs (and its compression)
mm: clarify why we want kmalloc before falling backto vmallock
frv: declare jiffies to be located in the .data section
include/linux/gfp.h: fix ___GFP_NOLOCKDEP value
ksm: prevent crash after write_protect_page fails
We currently have two related PMD vs PTE races in the DAX code. These
can both be easily triggered by having two threads reading and writing
simultaneously to the same private mapping, with the key being that
private mapping reads can be handled with PMDs but private mapping
writes are always handled with PTEs so that we can COW.
Here is the first race:
CPU 0 CPU 1
(private mapping write)
__handle_mm_fault()
create_huge_pmd() - FALLBACK
handle_pte_fault()
passes check for pmd_devmap()
(private mapping read)
__handle_mm_fault()
create_huge_pmd()
dax_iomap_pmd_fault() inserts PMD
dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
installed in our page tables at this spot.
Here's the second race:
CPU 0 CPU 1
(private mapping read)
__handle_mm_fault()
passes check for pmd_none()
create_huge_pmd()
dax_iomap_pmd_fault() inserts PMD
(private mapping write)
__handle_mm_fault()
create_huge_pmd() - FALLBACK
(private mapping read)
__handle_mm_fault()
passes check for pmd_none()
create_huge_pmd()
handle_pte_fault()
dax_iomap_pte_fault() inserts PTE
dax_iomap_pmd_fault() inserts PMD,
but we already have a PTE at
this spot.
The core of the issue is that while there is isolation between faults to
the same range in the DAX fault handlers via our DAX entry locking,
there is no isolation between faults in the code in mm/memory.c. This
means for instance that this code in __handle_mm_fault() can run:
if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
ret = create_huge_pmd(&vmf);
But by the time we actually get to run the fault handler called by
create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
fault has installed a normal PMD here as a parent. This is the cause of
the 2nd race. The first race is similar - there is the following check
in handle_pte_fault():
} else {
/* See comment in pte_alloc_one_map() */
if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
return 0;
So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
will bail and retry the fault. This is correct, but there is nothing
preventing the PMD from being installed after this check but before we
actually get to the DAX PTE fault handlers.
In my testing these races result in the following types of errors:
BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
BUG: non-zero nr_ptes on freeing mm: 15
Fix this issue by having the DAX fault handlers verify that it is safe
to continue their fault after they have taken an entry lock to block
other racing faults.
[ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pawel Lebioda <pawel.lebioda@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Xiong Zhou <xzhou@redhat.com>
Cc: Eryu Guan <eguan@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull XFS fix from Darrick Wong:
"I've one more bugfix for you for 4.12-rc4: Fix an unmount hang due to
a race in io buffer accounting"
* tag 'xfs-4.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: use ->b_state to fix buffer I/O accounting release race
Pull nfsd fixes from Bruce Fields:
"Revert patch accidentally included in the merge window pull request,
and fix a crash that was likely a result of buggy client behavior"
* tag 'nfsd-4.12-1' of git://linux-nfs.org/~bfields/linux:
nfsd4: fix null dereference on replay
nfsd: Revert "nfsd: check for oversized NFSv2/v3 arguments"
Pull gcc-plugin prepwork from Kees Cook:
"Use designated initializers for mtk-vcodec, powerplay, amdgpu, and
sgi-xp. Use ERR_CAST() to avoid cross-structure cast in ocf2, ntfs,
and NFS.
Christoph Hellwig recommended that I send these fixes now, rather than
waiting for the v4.13 merge window. These are all initializer and cast
fixes needed for the future randstruct plugin that haven't been picked
up by the respective maintainers"
* tag 'gcc-plugins-v4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
mtk-vcodec: Use designated initializers
drm/amd/powerplay: Use designated initializers
drm/amdgpu: Use designated initializers
sgi-xp: Use designated initializers
ocfs2: Use ERR_CAST() to avoid cross-structure cast
ntfs: Use ERR_CAST() to avoid cross-structure cast
NFS: Use ERR_CAST() to avoid cross-structure cast
Since using scsi_req() is only allowed against request queues for
which struct scsi_request is the first member of their private
request data, refuse to submit SCSI commands against a queue for
which this is not the case.
References: commit 82ed4db499 ("block: split scsi_request out of struct request")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Omar Sandoval <osandov@fb.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull Reiserfs and GFS2 fixes from Jan Kara:
"Fixes to GFS2 & Reiserfs for the fallout of the recent WRITE_FUA
cleanup from Christoph.
Fixes for other filesystems were already merged by respective
maintainers."
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: Make flush bios explicitely sync
gfs2: Make flush bios explicitely sync
Instead write a proper compat syscall that calls common helpers.
[ jlayton: fix pointer dereferencing in fixup_compat_flock ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
If we have to recover relocation during mount, we'll ultimately have to
evict the orphan inode. That goes through the reservation dance, where
priority_reclaim_metadata_space and flush_space expect fs_info->fs_root
to be valid. That's the next thing to be set up during mount, so we
crash, almost always in flush_space trying to join the transaction
but priority_reclaim_metadata_space is possible as well. This call
path has been problematic in the past WRT whether ->fs_root is valid
yet. Commit 957780eb27 (Btrfs: introduce ticketed enospc
infrastructure) added new users that are called in the direct path
instead of the async path that had already been worked around.
The thing is that we don't actually need the fs_root, specifically, for
anything. We either use it to determine whether the root is the
chunk_root for use in choosing an allocation profile or as a root to pass
btrfs_join_transaction before immediately committing it. Anything that
isn't the chunk root works in the former case and any root works in
the latter.
A simple fix is to use a root we know will always be there: the
extent_root.
Cc: <stable@vger.kernel.org> # v4.8+
Fixes: 957780eb27 (Btrfs: introduce ticketed enospc infrastructure)
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Variables start_idx and end_idx are supposed to hold a page index
derived from the file offsets. The int type is not the right one though,
offsets larger than 1 << 44 will get silently trimmed off the high bits.
(1 << 44 is 16TiB)
What can go wrong, if start is below the boundary and end gets trimmed:
- if there's a page after start, we'll find it (radix_tree_gang_lookup_slot)
- the final check "if (page->index <= end_idx)" will unexpectedly fail
The function will return false, ie. "there's no page in the range",
although there is at least one.
btrfs_page_exists_in_range is used to prevent races in:
* in hole punching, where we make sure there are not pages in the
truncated range, otherwise we'll wait for them to finish and redo
truncation, but we're going to replace the pages with holes anyway so
the only problem is the intermediate state
* lock_extent_direct: we want to make sure there are no pages before we
lock and start DIO, to prevent stale data reads
For practical occurence of the bug, there are several constaints. The
file must be quite large, the affected range must cross the 16TiB
boundary and the internal state of the file pages and pending operations
must match. Also, we must not have started any ordered data in the
range, otherwise we don't even reach the buggy function check.
DIO locking tries hard in several places to avoid deadlocks with
buffered IO and avoids waiting for ranges. The worst consequence seems
to be stale data read.
CC: Liu Bo <bo.li.liu@oracle.com>
CC: stable@vger.kernel.org # 3.16+
Fixes: fc4adbff82 ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking")
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The current time will be initially available in the record->time field
for all pstore_read() and pstore_write() calls. Backends can either
update the field during read(), or use the field during write() instead
of fetching time themselves.
Signed-off-by: Kees Cook <keescook@chromium.org>
In preparation for setting timestamps in the pstore core, create a common
initializer routine, instead of using static initializers.
Signed-off-by: Kees Cook <keescook@chromium.org>
If a backend does not correctly iterate through its records, pstore will
get stuck loading entries. Detect this with a large record count, and
announce if we ever hit the limit. This will let future backend reading
bugs less annoying to debug. Additionally adjust the error about
pstore_mkfile() failing.
Signed-off-by: Kees Cook <keescook@chromium.org>
When the "if (record->size <= 0)" test is true in
pstore_get_backend_records() it's pretty clear that nobody holds a
reference to the allocated pstore_record, yet we don't free it.
Let's free it.
Fixes: 2a2b0acf76 ("pstore: Allocate records on heap instead of stack")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
commit 9abdcccc3d ("pstore: Extract common arguments into structure")
moved record decompression to function. decompress_record() gets
called without checking type and compressed flag. Warning will be
reported if data is uncompressed. Pstore type PSTORE_TYPE_PPC_OPAL,
PSTORE_TYPE_PPC_COMMON doesn't contain compressed data and warning get
printed part of dmesg.
Partial dmesg log:
[ 35.848914] pstore: ignored compressed record type 6
[ 35.848927] pstore: ignored compressed record type 8
Above warning should not get printed as it is known that data won't be
compressed for above type and it is valid condition.
This patch returns if data is not compressed and print warning only if
data is compressed and type is not PSTORE_TYPE_DMESG.
Reported-by: Anton Blanchard <anton@au1.ibm.com>
Signed-off-by: Ankit Kumar <ankit@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Fixes: 9abdcccc3d ("pstore: Extract common arguments into structure")
Cc: stable@vger.kernel.org
Pull overlayfs fixes from Miklos Szeredi:
"Fix regressions:
- missing CONFIG_EXPORTFS dependency
- failure if upper fs doesn't support xattr
- bad error cleanup
This also adds the concept of "impure" directories complementing the
"origin" marking introduced in -rc1. Together they enable getting
consistent st_ino and d_ino for directory listings.
And there's a bug fix and a cleanup as well"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: filter trusted xattr for non-admin
ovl: mark upper merge dir with type origin entries "impure"
ovl: mark upper dir with type origin entries "impure"
ovl: remove unused arg from ovl_lookup_temp()
ovl: handle rename when upper doesn't support xattr
ovl: don't fail copy-up if upper doesn't support xattr
ovl: check on mount time if upper fs supports setting xattr
ovl: fix creds leak in copy up error path
ovl: select EXPORTFS