Commit Graph

30831 Commits

Author SHA1 Message Date
Zheng Liu
b3aff3e3f6 ext4: reimplement fiemap using extent status tree
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 21:57:37 -05:00
Zheng Liu
7d1b1fbc95 ext4: reimplement ext4_find_delay_alloc_range on extent status tree
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 21:57:35 -05:00
Zheng Liu
992e9fdd7b ext4: add some tracepoints in extent status tree
This patch adds some tracepoints in extent status tree.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 21:57:33 -05:00
Zheng Liu
51865fda28 ext4: let ext4 maintain extent status tree
This patch lets ext4 maintain extent status tree.

Currently it only tracks delay extent status in extent status tree.  When a
delay allocation is issued, the related delay extent will be inserted into
extent status tree.  When a delay extent is written out or invalidated, it will
be removed from this tree.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 21:57:32 -05:00
Zheng Liu
9a26b66175 ext4: initialize extent status tree
Let ext4 initialize extent status tree of an inode.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 21:57:30 -05:00
Zheng Liu
654598bef3 ext4: add operations on extent status tree
This patch adds operations on a extent status tree.

CC: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 21:57:20 -05:00
Brian Foster
579b62faa5 xfs: add background scanning to clear eofblocks inodes
Create a new mount workqueue and delayed_work to enable background
scanning and freeing of eofblocks inodes. The scanner kicks in once
speculative preallocation occurs and stops requeueing itself when
no eofblocks inodes exist.

The scan interval is based on the new
'speculative_prealloc_lifetime' tunable (default to 5m). The
background scanner performs unfiltered, best effort scans (which
skips inodes under lock contention or with a dirty cache mapping).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 15:34:59 -06:00
Brian Foster
00ca79a04b xfs: add minimum file size filtering to eofblocks scan
Support minimum file size filtering in the eofblocks scan. The
caller must set the XFS_EOF_FLAGS_MINFILESIZE flags bit and minimum
file size value in bytes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 15:32:29 -06:00
Brian Foster
1b5560488d xfs: support multiple inode id filtering in eofblocks scan
Enhance the eofblocks scan code to filter based on multiply specified
inode id values. When multiple inode id values are specified, only
inodes that match all id values are selected.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 15:31:13 -06:00
Brian Foster
3e3f9f5863 xfs: add inode id filtering to eofblocks scan
Support inode ID filtering in the eofblocks scan. The caller must
set the associated XFS_EOF_FLAGS_*ID bit and ID field.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 15:29:14 -06:00
Brian Foster
8ca149de80 xfs: add XFS_IOC_FREE_EOFBLOCKS ioctl
The XFS_IOC_FREE_EOFBLOCKS ioctl allows users to invoke an EOFBLOCKS
scan. The xfs_eofblocks structure is defined to support the command
parameters (scan mode).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 15:27:49 -06:00
Brian Foster
41176a68e3 xfs: create function to scan and clear EOFBLOCKS inodes
xfs_inodes_free_eofblocks() implements scanning functionality for
EOFBLOCKS inodes. It uses the AG iterator to walk the tagged inodes
and free post-EOF blocks via the xfs_inode_free_eofblocks() execute
function. The scan can be invoked in best-effort mode or wait
(force) mode.

A best-effort scan (default) handles all inodes that do not have a
dirty cache and we successfully acquire the io lock via trylock. In
wait mode, we continue to cycle through an AG until all inodes are
handled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 15:25:40 -06:00
Brian Foster
40165e2761 xfs: make xfs_free_eofblocks() non-static, return EAGAIN on trylock failure
Turn xfs_free_eofblocks() into a non-static function, return EAGAIN to
indicate trylock failure and make sure this error is not propagated in
xfs_release().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 15:24:26 -06:00
Brian Foster
72b53efa4a xfs: create helper to check whether to free eofblocks on inode
This check is used in multiple places to determine whether we
should check for (and potentially free) post EOF blocks on an
inode. Add a helper to consolidate the check.

Note that when we remove an inode from the cache (xfs_inactive()),
we are required to trim post-EOF blocks even if the inode is marked
preallocated or append-only to maintain correct space accounting.
The 'force' parameter to xfs_can_free_eofblocks() specifies whether
we should ignore the prealloc/append-only status of the inode.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 15:22:34 -06:00
Brian Foster
a454f7428f xfs: support a tag-based inode_ag_iterator
Genericize xfs_inode_ag_walk() to support an optional radix tree tag
and args argument for the execute function. Create a new wrapper
called xfs_inode_ag_iterator_tag() that performs a tag based walk
of perag's and inodes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 15:20:38 -06:00
Brian Foster
27b5286792 xfs: add EOFBLOCKS inode tagging/untagging
Add the XFS_ICI_EOFBLOCKS_TAG inode tag to identify inodes with
speculatively preallocated blocks beyond EOF. An inode is tagged
when speculative preallocation occurs and untagged either via
truncate down or when post-EOF blocks are freed via release or
reclaim.

The tag management is intentionally not aggressive to prefer
simplicity over the complexity of handling all the corner cases
under which post-EOF blocks could be freed (i.e., forward
truncation, fallocate, write error conditions, etc.). This means
that a tagged inode may or may not have post-EOF blocks after a
period of time. The tag is eventually cleared when the inode is
released or reclaimed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 14:20:44 -06:00
Zheng Liu
c0677e6d0f ext4: add data structures for the extent status tree
This patch adds two structures that supports extent status tree, extent_status
and ext4_es_tree.  Currently extent_status is used to track a delay extent for
an inode, which record the start block and the length of the delay extent.
ext4_es_tree is used to store all extent_status for an inode in memory.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 15:18:54 -05:00
Lukas Czerner
07aa2ea138 ext4: fix error handling in ext4_fill_super()
There are some places in ext4_fill_super() where we would not return
proper error code if something fails. The confusion is caused probably
due to the fact that we have two "kind-of" return variables 'ret'and
'err'.

'ret' is used to return error code from ext4_fill_super() where err is
used to store return values from other functions within ext4_fill_super().
However some places were missing the obligatory 'ret = err'. We could
put the assignment where it is missing, but we can have better "future
proof" solution. Or we could convert the code to use just one, but it
would require more rewrites.

This commit fixes the problem by returning value from 'err' variable if
it is set and 'ret' otherwise in error handling branch of the
ext4_fill_super(). The reasoning is that 'ret' value is often set to
default "-EINVAL" or explicit value, where 'err' is used to store
return value from other functions and should be otherwise zero.

https://bugzilla.kernel.org/show_bug.cgi?id=48431

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 15:16:54 -05:00
Eugene Shatokhin
24ec19b0ae ext4: fix memory leak in ext4_xattr_set_acl()'s error path
In ext4_xattr_set_acl(), if ext4_journal_start() returns an error,
posix_acl_release() will not be called for 'acl' which may result in a
memory leak.

This patch fixes that.

Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-11-08 15:11:11 -05:00
Anatol Pomozov
8b0f165f79 ext4: remove code duplication in ext4_get_block_write_nolock()
729f52c6be introduced function ext4_get_block_write_nolock() that
is very similar to _ext4_get_block(). Eliminate code duplication
by passing different flags to _ext4_get_block()

Tested: xfs tests

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 15:07:16 -05:00
Anatol Pomozov
8d8c182570 ext4: use 'inode' variable that is already dereferenced
Tested: xfs tests

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 14:53:35 -05:00
Zheng Liu
3779473246 ext4: fix missing call to trace_ext4_ext_map_blocks_exit
When ext4_ext_handle_uninitialized_extents(), we will directly return
from ext4_ext_map_blocks().  The trace point of
trace_ext4_ext_map_blocks_exit isn't called, and the user doesn't see
any result.  This patch tries to fix this problem.

Meanwhile in ext4_ext_handle_uninitialized_extents it returns errors
or the number of allocated blocks.  So 'ret' variable can be removed
due to previously modifications.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
2012-11-08 14:47:52 -05:00
Zheng Liu
19b303d8b5 ext4: print map->m_flags in trace_ext4_ext/ind_map_blocks_exit
When we use trace_ext4_ext/ind_map_blocks_exit, print the value of
map->m_flags in order that we can understand the extent's current
status.

Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 14:34:04 -05:00
Zheng Liu
b5645534ce ext4: print 'flags' in ext4_ext_handle_uninitialized_extents
In trace_ext4_ext_handle_uninitialized_extents we don't care about the
value of map->m_flags because this value is probably 0, and we prefer
to get the value of flags because we can know how to handle this
extent in this function.

Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 14:33:43 -05:00
Lukas Czerner
d71c1ae23a ext4: warn when discard request fails other than EOPNOTSUPP
We should warn user then the discard request fails. However we need to
exclude -EOPNOTSUPP case since parts of the device might not support it
while other parts can. So print the kernel warning when the error !=
-EOPNOTSUPP is returned from ext4_issue_discard().

We should also handle error cases in batched discard, again excluding
EOPNOTSUPP.

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 14:04:52 -05:00
Lukas Czerner
79add3a3f7 ext4: notify when discard is not supported
Notify user when mounting the file system with -o discard option, but
the device does not support discard. Obviously we do not want to fail
the mount or disable the options, because the underlying device might
change in future even without file system remount.

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 13:28:29 -05:00
Alan Cox
d8ec0c3960 ext4: remove unused assignment
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 12:19:58 -05:00
Dave Chinner
6ce377afd1 xfs: fix reading of wrapped log data
Commit 4439647 ("xfs: reset buffer pointers before freeing them") in
3.0-rc1 introduced a regression when recovering log buffers that
wrapped around the end of log. The second part of the log buffer at
the start of the physical log was being read into the header buffer
rather than the data buffer, and hence recovery was seeing garbage
in the data buffer when it got to the region of the log buffer that
was incorrectly read.

Cc: <stable@vger.kernel.org> # 3.0.x, 3.2.x, 3.4.x 3.6.x
Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 11:10:51 -06:00
Dave Chinner
03b1293eda xfs: fix buffer shudown reference count mismatch
When we shut down the filesystem, we have to unpin and free all the
buffers currently active in the CIL. To do this we unpin and remove
them in one operation as a result of a failed iclogbuf write. For
buffers, we do this removal via a simultated IO completion of after
marking the buffer stale.

At the time we do this, we have two references to the buffer - the
active LRU reference and the buf log item.  The LRU reference is
removed by marking the buffer stale, and the active CIL reference is
by the xfs_buf_iodone() callback that is run by
xfs_buf_do_callbacks() during ioend processing (via the bp->b_iodone
callback).

However, ioend processing requires one more reference - that of the
IO that it is completing. We don't have this reference, so we free
the buffer prematurely and use it after it is freed. For buffers
marked with XBF_ASYNC, this leads to assert failures in
xfs_buf_rele() on debug kernels because the b_hold count is zero.

Fix this by making sure we take the necessary IO reference before
starting IO completion processing on the stale buffer, and set the
XBF_ASYNC flag to ensure that IO completion processing removes all
the active references from the buffer to ensure it is fully torn
down.

Cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 11:10:35 -06:00
Dave Chinner
4b62acfe99 xfs: don't vmap inode cluster buffers during free
Inode buffers do not need to be mapped as inodes are read or written
directly from/to the pages underlying the buffer. This fixes a
regression introduced by commit 611c994 ("xfs: make XBF_MAPPED the
default behaviour").

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 11:10:18 -06:00
Dave Chinner
ca250b1b3d xfs: invalidate allocbt blocks moved to the free list
When we free a block from the alloc btree tree, we move it to the
freelist held in the AGFL and mark it busy in the busy extent tree.
This typically happens when we merge btree blocks.

Once the transaction is committed and checkpointed, the block can
remain on the free list for an indefinite amount of time.  Now, this
isn't the end of the world at this point - if the free list is
shortened, the buffer is invalidated in the transaction that moves
it back to free space. If the buffer is allocated as metadata from
the free list, then all the modifications getted logged, and we have
no issues, either. And if it gets allocated as userdata direct from
the freelist, it gets invalidated and so will never get written.

However, during the time it sits on the free list, pressure on the
log can cause the AIL to be pushed and the buffer that covers the
block gets pushed for write. IOWs, we end up writing a freed
metadata block to disk. Again, this isn't the end of the world
because we know from the above we are only writing to free space.

The problem, however, is for validation callbacks. If the block was
on old btree root block, then the level of the block is going to be
higher than the current tree root, and so will fail validation.
There may be other inconsistencies in the block as well, and
currently we don't care because the block is in free space. Shutting
down the filesystem because a freed block doesn't pass write
validation, OTOH, is rather unfriendly.

So, make sure we always invalidate buffers as they move from the
free space trees to the free list so that we guarantee they never
get written to disk while on the free list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phil White <pwhite@sgi.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 11:09:44 -06:00
Dave Chinner
1e7acbb7bc xfs: silence uninitialised f.file warning.
Uninitialised variable build warning introduced by 2903ff0 ("switch
simple cases of fget_light to fdget"), gcc is not smart enough to
work out that the variable is not used uninitialised, and the commit
removed the initialisation at declaration that the old variable had.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 11:09:17 -06:00
Dave Chinner
eaef854335 xfs: growfs: don't read garbage for new secondary superblocks
When updating new secondary superblocks in a growfs operation, the
superblock buffer is read from the newly grown region of the
underlying device. This is not guaranteed to be zero, so violates
the underlying assumption that the unused parts of superblocks are
zero filled. Get a new buffer for these secondary superblocks to
ensure that the unused regions are zero filled correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 11:08:57 -06:00
Dave Chinner
1f3c785c3a xfs: move allocation stack switch up to xfs_bmapi_allocate
Switching stacks are xfs_alloc_vextent can cause deadlocks when we
run out of worker threads on the allocation workqueue. This can
occur because xfs_bmap_btalloc can make multiple calls to
xfs_alloc_vextent() and even if xfs_alloc_vextent() fails it can
return with the AGF locked in the current allocation transaction.

If we then need to make another allocation, and all the allocation
worker contexts are exhausted because the are blocked waiting for
the AGF lock, holder of the AGF cannot get it's xfs-alloc_vextent
work completed to release the AGF.  Hence allocation effectively
deadlocks.

To avoid this, move the stack switch one layer up to
xfs_bmapi_allocate() so that all of the allocation attempts in a
single switched stack transaction occur in a single worker context.
This avoids the problem of an allocation being blocked waiting for
a worker thread whilst holding the AGF.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 11:08:46 -06:00
Dave Chinner
326c03555b xfs: introduce XFS_BMAPI_STACK_SWITCH
Certain allocation paths through xfs_bmapi_write() are in situations
where we have limited stack available. These are almost always in
the buffered IO writeback path when convertion delayed allocation
extents to real extents.

The current stack switch occurs for userdata allocations, which
means we also do stack switches for preallocation, direct IO and
unwritten extent conversion, even those these call chains have never
been implicated in a stack overrun.

Hence, let's target just the single stack overun offended for stack
switches. To do that, introduce a XFS_BMAPI_STACK_SWITCH flag that
the caller can pass xfs_bmapi_write() to indicate it should switch
stacks if it needs to do allocation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 11:08:27 -06:00
Mark Tinguely
408cc4e97a xfs: zero allocation_args on the kernel stack
Zero the kernel stack space that makes up the xfs_alloc_arg structures.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 11:08:10 -06:00
Dave Chinner
7e9620f21d xfs: only update the last_sync_lsn when a transaction completes
The log write code stamps each iclog with the current tail LSN in
the iclog header so that recovery knows where to find the tail of
thelog once it has found the head. Normally this is taken from the
first item on the AIL - the log item that corresponds to the oldest
active item in the log.

The problem is that when the AIL is empty, the tail lsn is dervied
from the the l_last_sync_lsn, which is the LSN of the last iclog to
be written to the log. In most cases this doesn't happen, because
the AIL is rarely empty on an active filesystem. However, when it
does, it opens up an interesting case when the transaction being
committed to the iclog spans multiple iclogs.

That is, the first iclog is stamped with the l_last_sync_lsn, and IO
is issued. Then the next iclog is setup, the changes copied into the
iclog (takes some time), and then the l_last_sync_lsn is stamped
into the header and IO is issued. This is still the same
transaction, so the tail lsn of both iclogs must be the same for log
recovery to find the entire transaction to be able to replay it.

The problem arises in that the iclog buffer IO completion updates
the l_last_sync_lsn with it's own LSN. Therefore, If the first iclog
completes it's IO before the second iclog is filled and has the tail
lsn stamped in it, it will stamp the LSN of the first iclog into
it's tail lsn field. If the system fails at this point, log recovery
will not see a complete transaction, so the transaction will no be
replayed.

The fix is simple - the l_last_sync_lsn is updated when a iclog
buffer IO completes, and this is incorrect. The l_last_sync_lsn
shoul dbe updated when a transaction is completed by a iclog buffer
IO. That is, only iclog buffers that have transaction commit
callbacks attached to them should update the l_last_sync_lsn. This
means that the last_sync_lsn will only move forward when a commit
record it written, not in the middle of a large transaction that is
rolling through multiple iclog buffers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-11-08 11:07:38 -06:00
Zhao Hongjiang
d339450cca ext4: get rid of redundant code in ext4_fill_super()
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 12:07:33 -05:00
Eric Sandeen
37be2f59d3 ext4: remove ext4_handle_release_buffer()
ext4_handle_release_buffer() was intended to remove journal
write access from a buffer, but it doesn't actually do anything
at all other than add a BUFFER_TRACE point, but it's not reliably
used for that either.  Remove all the associated dead code.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2012-11-08 11:22:46 -05:00
Eric Sandeen
6d138ced75 ext4: fix awful goto in ext4_mb_new_blocks()
I think the whole function could be made prettier, but
that goto really took the cake for too-clever-by-half.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 11:11:59 -05:00
Eric Sandeen
b72f78cb63 ext4: fix overhead calculations in ext4_stats, again
"overhead" was a write-only variable in this function after commit
952fc18e; we set it to 0 for minixdf, or to sbi->s_overhead if !minixdf,
but never read it again after that.

We need to use it, not sbi->s_overhead, when subtracting out overhead
for f_blocks, or we get the wrong answer for minixdf.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 10:33:36 -05:00
Li Wang
e4bc6522d5 eCryptfs: Avoid unnecessary disk read and data decryption during writing
ecryptfs_write_begin grabs a page from page cache for writing.
If the page contains invalid data, or data older than the
counterpart on the disk, eCryptfs will read out the
corresponing data from the disk into the page, decrypt them,
then perform writing. However, for this page, if the length
of the data to be written into is equal to page size,
that means the whole page of data will be overwritten,
in which case, it does not matter whatever the data were before,
it is beneficial to perform writing directly rather than bothering
to read and decrypt first.

With this optimization, according to our test on a machine with
Intel Core 2 Duo processor, iozone 'write' operation on an existing
file with write size being multiple of page size will enjoy a steady
3x speedup.

Signed-off-by: Li Wang <wangli@kylinos.com.cn>
Signed-off-by: Yunchuan Wen <wenyunchuan@kylinos.com.cn>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2012-11-07 17:56:16 -08:00
J. Bruce Fields
12fc3e92d4 nfsd4: backchannel should use client-provided security flavor
For now this only adds support for AUTH_NULL.  (Previously we assumed
AUTH_UNIX.)  We'll also need AUTH_GSS, which is trickier.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-07 19:40:05 -05:00
J. Bruce Fields
57725155dc nfsd4: common helper to initialize callback work
I've found it confusing having the only references to
nfsd4_do_callback_rpc() in a different file.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-07 19:40:04 -05:00
J. Bruce Fields
cb73a9f464 nfsd4: implement backchannel_ctl operation
This operation is mandatory for servers to implement.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-07 19:39:58 -05:00
J. Bruce Fields
c6bb3ca27d nfsd4: use callback security parameters in create_session
We're currently ignoring the callback security parameters specified in
create_session, and just assuming the client wants auth_sys, because
that's all the current linux client happens to care about.  But this
could cause us callbacks to fail to a client that wanted something
different.

For now, all we're doing is no longer ignoring the uid and gid passed in
the auth_sys case.  Further patches will add support for auth_null and
gss (and possibly use more of the auth_sys information; the spec wants
us to use exactly the credential we're passed, though it's hard to
imagine why a client would care).

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-07 19:31:35 -05:00
J. Bruce Fields
acb2887e04 nfsd4: clean up callback security parsing
Move the callback parsing into a separate function.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-07 19:31:35 -05:00
J. Bruce Fields
face15025f nfsd: use vfs_fsync_range(), not O_SYNC, for stable writes
NFSv4 shares the same struct file across multiple writes.  (And we'd
like NFSv2 and NFSv3 to do that as well some day.)

So setting O_SYNC on the struct file as a way to request a synchronous
write doesn't work.

Instead, do a vfs_fsync_range() in that case.

Reported-by: Peter Staubach <pstaubach@exagrid.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-07 19:31:34 -05:00
J. Bruce Fields
fae5096ad2 nfsd: assume writeable exportabled filesystems have f_sync
I don't really see how you could claim to support nfsd and not support
fsync somehow.

And in practice a quick look through the exportable filesystems suggests
the only ones without an ->fsync are read-only (efs, isofs, squashfs) or
in-memory (shmem).

Also, performing a write and then returning an error if the sync fails
(as we would do here in the wgather case) seems unhelpful to clients.

Also remove an incorrect comment.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-07 19:31:33 -05:00
J. Bruce Fields
7fa10cd12d nfsd4: don't BUG in delegation break callback
These conditions would indeed indicate bugs in the code, but if we want
to hear about them we're likely better off warning and returning than
immediately dying while holding file_lock_lock.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-11-07 19:31:33 -05:00