Commit Graph

47524 Commits

Author SHA1 Message Date
Jan Kara
2d90c160e5 ext4: more efficient SEEK_DATA implementation
Using SEEK_DATA in a huge sparse file can easily lead to sotflockups as
ext4_seek_data() iterates hole block-by-block. Fix the problem by using
returned hole size from ext4_map_blocks() and thus skip the hole in one
go.

Update also SEEK_HOLE implementation to follow the same pattern as
SEEK_DATA to make future maintenance easier.

Furthermore we add cond_resched() to both ext4_seek_data() and
ext4_seek_hole() to avoid softlockups in case evil user creates huge
fragmented file and we have to go through lots of extents.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 23:11:13 -05:00
Jan Kara
e3fb8eb14e ext4: cleanup handling of bh->b_state in DAX mmap
ext4_dax_mmap_get_block() updates bh->b_state directly instead of using
ext4_update_bh_state(). This is mostly a cosmetic issue since DAX code
always passes on-stack buffer_head but clean this up to make code more
uniform.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 23:03:27 -05:00
Jan Kara
facab4d971 ext4: return hole from ext4_map_blocks()
Currently, ext4_map_blocks() just returns 0 when it finds a hole and
allocation is not requested. However we have all the information
available to tell how large the hole actually is and there are callers
of ext4_map_blocks() which would save some block-by-block hole iteration
if they knew this information. So fill in struct ext4_map_blocks even
for holes with the information we have. We keep returning 0 for holes to
maintain backward compatibility of the function.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 22:54:00 -05:00
Jan Kara
140a52508a ext4: factor out determining of hole size
ext4_ext_put_gap_in_cache() determines hole size in the extent tree,
then trims this with possible delayed allocated blocks, and inserts the
result into the extent status tree. Factor out determination of the size
of the hole in the extent tree as we will need this information in
ext4_ext_map_blocks() as well.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 22:46:57 -05:00
Linus Torvalds
718e47a573 Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fix from Ted Ts'o:
 "This fixes a regression which crept in v4.5-rc5"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: iterate over buffer heads correctly in move_extent_per_page()
2016-03-09 19:33:05 -08:00
Jan Kara
87d8a74b56 ext4: fix setting of referenced bit in ext4_es_lookup_extent()
We were setting referenced bit on the extent structure we return from
ext4_es_lookup_extent() which is just a private structure on stack. Thus
setting had no effect. Set the bit in the structure in the status tree
instead.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 22:26:55 -05:00
Eryu Guan
6ffe77bad5 ext4: iterate over buffer heads correctly in move_extent_per_page()
In commit bcff24887d ("ext4: don't read blocks from disk after extents
being swapped") bh is not updated correctly in the for loop and wrong
data has been written to disk. generic/324 catches this on sub-page
block size ext4.

Fixes: bcff24887d ("ext4: don't read blocks from disk after extentsbeing swapped")
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 21:37:53 -05:00
Ross Zwisler
30f471fd88 dax: check return value of dax_radix_entry()
dax_pfn_mkwrite() previously wasn't checking the return value of the
call to dax_radix_entry(), which was a mistake.

Instead, capture this return value and return the appropriate VM_FAULT_
value.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 15:43:42 -08:00
Jan Kara
566e8dfd88 ocfs2: fix return value from ocfs2_page_mkwrite()
ocfs2_page_mkwrite() could mistakenly return error code instead of
mkwrite status value.  Fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 15:43:42 -08:00
Martin Brandenburg
acfcbaf192 orangefs: make fs_mount_pending static
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-09 13:26:39 -05:00
Martin Brandenburg
c62da5853d orangefs: Avoid symlink upcall if target is too long.
Previously the client-core detected this condition by sheer luck!

Since we used strncpy, no NUL byte would be included on the name. The
client-core would call strlen, which would read past the end of its
buffer, but return a number large enough that the client-core would
return ENAMETOOLONG.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-09 13:26:39 -05:00
Mike Marshall
162ada7764 Orangefs: improve the POSIXness of interrupted writes...
Don't return EINTR on interrupted writes if some data has already
been written.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-09 13:12:37 -05:00
Mike Marshall
cf07c0bf88 Orangefs: add a new gossip statement
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-09 13:11:45 -05:00
Jan Kara
600be30a8b ext4: remove i_ioend_count
Remove counter of pending io ends as it is unused.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 23:39:21 -05:00
Jan Kara
109811c20f ext4: simplify io_end handling for AIO DIO
When mapping blocks for direct IO, we allocate io_end structure before
mapping blocks and store pointer to it in the inode. This creates a
requirement that any AIO DIO using io_end must be protected by i_mutex.
This created problems in the past with dioread_nolock mode which was
corrupting io_end pointers. Also io_end is allocated unnecessarily in
case where we don't need to convert any extents (which is a common case
for example when overwriting file).

We fix the problem by allocating io_end only once we return unwritten
extent from block mapping function for AIO DIO (so we can save some
pointless io_end allocations) and we pass pointer to it in bh->b_private
which generic DIO code later passes to our end IO callback. That way we
remove any need for global pointer to io_end structure and thus fix the
races.

The downside of this change is that the checking for unwritten IO in
flight in ext4_extents_can_be_merged() is more racy since we now
increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after dropping
i_data_sem. However the check has been racy already before because
ext4_writepages() already increment i_unwritten after dropping
i_data_sem and reserved blocks save us from hitting ENOSPC in the worst
case.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-03-08 23:36:46 -05:00
Jan Kara
efe70c2951 ext4: move trans handling and completion deferal out of _ext4_get_block
There is no need to handle starting of a transaction and deferal of DIO
completion in _ext4_get_block() function. We can move this out to get
block functions for direct IO that need it. That way we can add stricter
checks verifying things work as we expect.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 23:35:46 -05:00
Jan Kara
705965bd6d ext4: rename and split get blocks functions
Rename ext4_get_blocks_write() to ext4_get_blocks_unwritten() to better
describe what it does. Also split out get blocks functions for direct
IO. Later we move functionality from _ext4_get_blocks() there. There's no
functional change in this patch.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 23:08:10 -05:00
Jan Kara
e142d05263 ext4: use i_mutex to serialize unaligned AIO DIO
Currently we've used hashed aio_mutex to serialize unaligned AIO DIO.
However the code cleanups that happened after 2011 when the lock was
introduced made aio_mutex acquired at almost the same places where we
already have exclusion using i_mutex. So just use i_mutex for the
exclusion of unaligned AIO DIO.

The change moves waiting for pending unwritten extent conversion under
i_mutex. That makes special handling of O_APPEND writes unnecessary and
also avoids possible livelocking of unaligned AIO DIO with aligned one
(nothing was preventing contiguous stream of aligned AIO DIOs to let
unaligned AIO DIO wait forever).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 22:44:50 -05:00
Jan Kara
3bd6ad7b68 ext4: pack ioend structure better
On 64-bit architectures we have two 4-byte holes in struct ext4_io_end.
Order entries better to avoid this and thus make the structure occupy
64 instead of 72 bytes for 64-bit architectures.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 22:26:39 -05:00
Dave Chinner
ab9d1e4f7b Merge branch 'xfs-misc-fixes-4.6-3' into for-next 2016-03-09 08:18:30 +11:00
Luis de Bethencourt
a5fd276bdc xfs: remove impossible condition
bp_release is set to 0 just before the breakpoint of the for loop before
the conditional check (in line 458). The other breakpoint is a goto that
skips the dead code.

Addresses-Coverity-Id: 102338

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-09 08:17:56 +11:00
Darrick J. Wong
30cbc591c3 xfs: check sizes of XFS on-disk structures at compile time
Check the sizes of XFS on-disk structures when compiling the kernel.
Use this to catch inadvertent changes in structure size due to padding
and alignment issues, etc.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-09 08:15:14 +11:00
Al Viro
f93812846f jffs2: reduce the breakage on recovery from halfway failed rename()
d_instantiate(new_dentry, old_inode) is absolutely wrong thing to
do - it will oops if new_dentry used to be positive, for starters.
What we need is d_invalidate() the target and be done with that.

Cc: stable@vger.kernel.org # v3.18+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-07 23:07:10 -05:00
Al Viro
803c00123a ncpfs: fix a braino in OOM handling in ncp_fill_cache()
Failing to allocate an inode for child means that cache for *parent* is
incompletely populated.  So it's parent directory inode ('dir') that
needs NCPI_DIR_CACHE flag removed, *not* the child inode ('inode', which
is what we'd failed to allocate in the first place).

Fucked-up-in: commit 5e993e25 ("ncpfs: get rid of d_validate() nonsense")
Fucked-up-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org # v3.19
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-07 22:25:16 -05:00
Boris BREZILLON
f5b8aa78ef mtd: kill the ecclayout->oobavail field
ecclayout->oobavail is just redundant with the mtd->oobavail field.
Moreover, it prevents static const definition of ecc layouts since the
NAND framework is calculating this value based on the ecclayout->oobfree
field.

Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
2016-03-07 16:23:09 -08:00
Linus Torvalds
01ffa3df22 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "Overlayfs bug fixes.  All marked as -stable material"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: copy new uid/gid into overlayfs runtime inode
  ovl: ignore lower entries when checking purity of non-directory entries
  ovl: fix getcwd() failure after unsuccessful rmdir
  ovl: fix working on distributed fs as lower layer
2016-03-07 15:23:25 -08:00
Ingo Molnar
ec87e1cf7d Merge tag 'v4.5-rc7' into x86/asm, to pick up SMAP fix
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-07 09:27:30 +01:00
Dave Chinner
3c1a79f5ff Merge branch 'xfs-misc-fixes-4.6-2' into for-next 2016-03-07 09:34:54 +11:00
Dave Chinner
85a9f38d38 Merge branch 'xfs-dax-fixes-4.6' into for-next 2016-03-07 09:34:31 +11:00
Dave Chinner
3d93ec0364 Merge branch 'xfs-writepage-rework-4.6' into for-next 2016-03-07 09:34:02 +11:00
Darrick J. Wong
0df61da8ac xfs: ioends require logically contiguous file offsets
We need to create a new ioend if the current writepage call isn't
logically contiguous with the range contained in the previous ioend.
Hopefully writepage gets called in order of increasing file offset.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 09:32:14 +11:00
Dave Chinner
7f0ed5461a Merge branch 'xfs-buf-macro-cleanup-4.6' into for-next 2016-03-07 09:31:00 +11:00
Dave Chinner
a2bbcb60ff Merge branch 'xfs-gut-icdinode-4.6' into for-next 2016-03-07 09:30:32 +11:00
Dave Chinner
6d247d47fb Merge branch 'xfs-misc-fixes-4.6' into for-next 2016-03-07 09:30:12 +11:00
Dave Chinner
acb3e26fc3 Merge branch 'xfs-dio-fix-4.6' into for-next 2016-03-07 09:29:48 +11:00
Dave Chinner
1b186d25b0 Merge branch 'xfs-get-next-dquot-4.6' into for-next 2016-03-07 09:29:25 +11:00
Dave Chinner
c53473be45 Merge branch 'xfs-rt-fixes-4.6' into for-next 2016-03-07 09:29:04 +11:00
Dave Chinner
9deed09554 Merge branch 'xfs-torn-log-fixes-4.5' into for-next 2016-03-07 09:28:36 +11:00
Darrick J. Wong
5110cd82ca xfs: use named array initializers for log item dumping
Use named array initializers for the string arrays used to dump log
items, rather than depending on the order being maintained correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:40:03 +11:00
Darrick J. Wong
49ca9118e6 xfs: fix computation of inode btree maxlevels
Commit 88740da18[1] introduced a function to compute the maximum
height of the inode btree back in 1994.  Back then, apparently, the
freespace and inode btrees shared the same geometry; however, it has
long since been the case that the inode and freespace btrees have
different record and key sizes.  Therefore, we must use m_inobt_mnr if
we want a correct calculation/log reservation/etc.

(Yes, this bug has been around for 21 years and ten months.)

(Yes, I was in middle school when this bug was committed.)

[1] http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=88740da18ddd9d7ba3ebaa9502fefc6ef2fd19cd

Historical-research-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:39:56 +11:00
Dave Chinner
a798011c8f xfs: reinitialise per-AG structures if geometry changes during recovery
If a crash occurs immediately after a filesystem grow operation, the
updated superblock geometry is found only in the log. After we
recover the log, the superblock is reread and re-initialised and so
has the new geometry in memory. If the new geometry has more AGs
than prior to the grow operation, then the new AGs will not have
in-memory xfs_perag structurea associated with them.

This will result in an oops when the first metadata buffer from a
new AG is looked up in the buffer cache, as the block lies within
the new geometry but then fails to find a perag structure on lookup.
This is easily fixed by simply re-initialising the perag structure
after re-reading the superblock at the conclusion of the first pahse
of log recovery.

This, however, does not fix the case of log recovery requiring
access to metadata in the newly grown space. Fortunately for us,
because the in-core superblock has not been updated, this will
result in detection of access beyond the end of the filesystem
and so recovery will fail at that point. If this proves to be
a problem, then we can address it separately to the current
reported issue.

Reported-by: Alex Lyakas <alex@zadarastorage.com>
Tested-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
2016-03-07 08:39:36 +11:00
Brian Foster
7f6aff3a29 xfs: only run torn log write detection on dirty logs
XFS uses CRC verification over a sub-range of the head of the log to
detect and handle torn writes. This torn log write detection currently
runs unconditionally at mount time, regardless of whether the log is
dirty or clean. This is problematic in cases where a filesystem might
end up being moved across different, incompatible (i.e., opposite
byte-endianness) architectures.

The problem lies in the fact that log data is not necessarily written in
an architecture independent format. For example, certain bits of data
are written in native endian format. Further, the size of certain log
data structures differs (i.e., struct xlog_rec_header) depending on the
word size of the cpu. This leads to false positive crc verification
errors and ultimately failed mounts when a cleanly unmounted filesystem
is mounted on a system with an incompatible architecture from data that
was written near the head of the log.

Update the log head/tail discovery code to run torn write detection only
when the log is not clean. This means something other than an unmount
record resides at the head of the log and log recovery is imminent. It
is a requirement to run log recovery on the same type of host that had
written the content of the dirty log and therefore CRC failures are
legitimate corruptions in that scenario.

Reported-by: Jan Beulich <JBeulich@suse.com>
Tested-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:22:22 +11:00
Brian Foster
717bc0ebca xfs: refactor in-core log state update to helper
Once the record at the head of the log is identified and verified, the
in-core log state is updated based on the record. This includes
information such as the current head block and cycle, the start block of
the last record written to the log, the tail lsn, etc.

Once torn write detection is conditional, this logic will need to be
reused. Factor the code to update the in-core log data structures into a
new helper function. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:22:22 +11:00
Brian Foster
65b99a08b3 xfs: refactor unmount record detection into helper
Once the mount sequence has identified the head and tail blocks of the
physical log, the record at the head of the log is located and examined
for an unmount record to determine if the log is clean. This currently
occurs after torn write verification of the head region of the log.

This must ultimately be separated from torn write verification and may
need to be called again if the log head is walked back due to a torn
write (to determine whether the new head record is an unmount record).
Separate this logic into a new helper function. This patch does not
change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:22:22 +11:00
Brian Foster
82ff6cc26e xfs: separate log head record discovery from verification
The code that locates the log record at the head of the log is buried in
the log head verification function. This is fine when torn write
verification occurs unconditionally, but this behavior is problematic
for filesystems that might be moved across systems with different
architectures.

In preparation for separating examination of the log head for unmount
records from torn write detection, lift the record location logic out of
the log verification function and into the caller. This patch does not
change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 08:22:22 +11:00
Linus Torvalds
21b27a74ec Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fix from Sage Weil:
 "This is a final commit we missed to align the protocol compatibility
  with the feature bits.

  It decodes a few extra fields in two different messages and reports
  EIO when they are used (not yet supported)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support
2016-03-06 11:31:13 -08:00
Christoph Hellwig
1ae1602de0 configfs: switch ->default groups to a linked list
Replace the current NULL-terminated array of default groups with a linked
list.  This gets rid of lots of nasty code to size and/or dynamically
allocate the array.

While we're at it also provide a conveniant helper to remove the default
groups.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Felipe Balbi <balbi@kernel.org>		[drivers/usb/gadget]
Acked-by: Joel Becker <jlbec@evilplan.org>
Acked-by: Nicholas Bellinger <nab@linux-iscsi.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
2016-03-06 16:11:24 +01:00
Al Viro
6c51e513a3 lookup_dcache(): lift d_alloc() into callers
... and kill need_lookup thing

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-05 20:09:32 -05:00
Al Viro
6583fe22d1 do_last(): reorder and simplify a bit
bugger off on negatives a bit earlier, simplify the tests

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-05 18:14:03 -05:00
Al Viro
05ef1c50e7 Merge branch 'for-linus' into work.lookups
for the sake of namei.c fixes
2016-03-05 18:10:51 -05:00