Commit Graph

48450 Commits

Author SHA1 Message Date
Mike Marshall
9f08cfe944 Orangefs: update orangefs.txt
Al Viro has cleaned up the way ops are processed and waited for,
now orangefs.txt has an overview of how it works. Several recent
related commits have added to the comments in the code as well.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-26 14:39:08 -05:00
Mike Marshall
ca9f518ead Orangefs: code sanitation.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-26 10:21:12 -05:00
Arnd Bergmann
401898eed7 orangefs: remove unused 'diff' function
orangefs contains a helper function to calculate the difference
between two timeval structures. We are trying to remove all
instances of timespec from the kernel, and this one is not
used at all, so let's remove it now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-26 10:18:43 -05:00
Arnd Bergmann
be81ce48b2 orangefs: avoid time conversion function
The new orangefs code uses a helper function to read a time field to
its private structures from struct iattr. This will conflict with the
move to 64-bit timestamps in the kernel and is generally not necessary.

This replaces the conversion with a simple cast to time64_t that shows
what is going on. As the orangefs-internal representation already uses
64-bit timestamps, there should be no ambiguity to negative values,
and the cast ensures that we treat them as times before 1970 on both
32-bit and 64-bit architectures, rather than times after 2038. This
patch keeps that behavior.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-26 10:18:39 -05:00
David Sterba
f5bc27c71a Merge branch 'dev/control-ioctl' into for-chris-4.6 2016-02-26 15:38:34 +01:00
David Sterba
fa695b01bc Merge branch 'misc-4.6' into for-chris-4.6
# Conflicts:
#	fs/btrfs/file.c
2016-02-26 15:38:34 +01:00
David Sterba
f004fae0cf Merge branch 'cleanups-4.6' into for-chris-4.6 2016-02-26 15:38:33 +01:00
David Sterba
675d276b32 Merge branch 'foreign/liubo/replace-lockup' into for-chris-4.6 2016-02-26 15:38:32 +01:00
David Sterba
e9ddd77a31 Merge branch 'foreign/josef/space-updates' into for-chris-4.6 2016-02-26 15:38:31 +01:00
David Sterba
ff7db6e05a Merge branch 'foreign/zhaolei/reada' into for-chris-4.6 2016-02-26 15:38:30 +01:00
David Sterba
23c1a966f2 Merge branch 'foreign/qu/norecovery-v7' into for-chris-4.6 2016-02-26 15:38:30 +01:00
David Sterba
67d605fec1 Merge branch 'dev/rename-keys' into for-chris-4.6 2016-02-26 15:38:29 +01:00
David Sterba
e22b3d1fbe Merge branch 'dev/gfp-flags' into for-chris-4.6 2016-02-26 15:38:28 +01:00
David Sterba
5f1b5664d9 Merge branch 'chandan/prep-subpage-blocksize' into for-chris-4.6
# Conflicts:
#	fs/btrfs/file.c
2016-02-26 15:38:28 +01:00
Deepa Dinamani
b1f1a29d8f configfs: Replace CURRENT_TIME by current_fs_time()
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_fs_time() instead.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-02-26 10:42:35 +01:00
Chao Yu
80dd9c0e9d f2fs: fix incorrect upper bound when iterating inode mapping tree
1. Inode mapping tree can index page in range of [0, ULONG_MAX], however,
in some places, f2fs only search or iterate page in ragne of [0, LONG_MAX],
result in miss hitting in page cache.

2. filemap_fdatawait_range accepts range parameters in unit of bytes, so
the max range it covers should be [0, LLONG_MAX], if we use [0, LONG_MAX]
as range for waiting on writeback, big number of pages will not be covered.

This patch corrects above two issues.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-25 17:27:03 -08:00
David Woodhouse
be629c62a6 Fix directory hardlinks from deleted directories
When a directory is deleted, we don't take too much care about killing off
all the dirents that belong to it — on the basis that on remount, the scan
will conclude that the directory is dead anyway.

This doesn't work though, when the deleted directory contained a child
directory which was moved *out*. In the early stages of the fs build
we can then end up with an apparent hard link, with the child directory
appearing both in its true location, and as a child of the original
directory which are this stage of the mount process we don't *yet* know
is defunct.

To resolve this, take out the early special-casing of the "directories
shall not have hard links" rule in jffs2_build_inode_pass1(), and let the
normal nlink processing happen for directories as well as other inodes.

Then later in the build process we can set ic->pino_nlink to the parent
inode#, as is required for directories during normal operaton, instead
of the nlink. And complain only *then* about hard links which are still
in evidence even after killing off all the unreachable paths.

Reported-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@vger.kernel.org
2016-02-25 11:11:28 +00:00
David Woodhouse
49e91e7079 jffs2: Fix page lock / f->sem deadlock
With this fix, all code paths should now be obtaining the page lock before
f->sem.

Reported-by: Szabó Tamás <sztomi89@gmail.com>
Tested-by: Thomas Betker <thomas.betker@rohde-schwarz.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@vger.kernel.org
2016-02-25 11:11:26 +00:00
Thomas Betker
157078f64b Revert "jffs2: Fix lock acquisition order bug in jffs2_write_begin"
This reverts commit 5ffd3412ae
("jffs2: Fix lock acquisition order bug in jffs2_write_begin").

The commit modified jffs2_write_begin() to remove a deadlock with
jffs2_garbage_collect_live(), but this introduced new deadlocks found
by multiple users. page_lock() actually has to be called before
mutex_lock(&c->alloc_sem) or mutex_lock(&f->sem) because
jffs2_write_end() and jffs2_readpage() are called with the page locked,
and they acquire c->alloc_sem and f->sem, resp.

In other words, the lock order in jffs2_write_begin() was correct, and
it is the jffs2_garbage_collect_live() path that has to be changed.

Revert the commit to get rid of the new deadlocks, and to clear the way
for a better fix of the original deadlock.

Reported-by: Deng Chao <deng.chao1@zte.com.cn>
Reported-by: Ming Liu <liu.ming50@gmail.com>
Reported-by: wangzaiwei <wangzaiwei@top-vision.cn>
Signed-off-by: Thomas Betker <thomas.betker@rohde-schwarz.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@vger.kernel.org
2016-02-25 11:11:25 +00:00
Martin Brandenburg
69a23de2f3 orangefs: clean up fill_default_sys_attrs
Size and type are read-only and not in the mask. The times were left
unset despite being in the mask.

We zero-fill the times since the server will fill them in and we will
get the correct time when we fill the inode with getattr.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-24 17:07:51 -05:00
Martin Brandenburg
6ceaf7818f orangefs: we never lookup with sym_follow set
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-24 17:07:51 -05:00
Martin Brandenburg
9c2bcf288e orangefs: remove vestigial async io code
I have verified that there is nothing in the userspace daemon version we
are implementing this protocol against that ever looks at this field.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-24 17:07:50 -05:00
Martin Brandenburg
47b4948fdb orangefs: use ORANGEFS_NAME_LEN everywhere; remove ORANGEFS_NAME_MAX
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-24 17:07:50 -05:00
Martin Brandenburg
ee70fca0bc orangefs: don't d_drop in d_revalidate since the caller will
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-24 17:07:50 -05:00
Martin Brandenburg
ee3b8d377c orangefs: free readdir buffer index before the dir_emit loop
We only need it while the service operation is actually in progress
since it is only used to co-ordinate the client-core's memory use. The
kernel allocates its own space.

Also clean up some comments which mislead the reader into thinking
the readdir buffers are shared memory.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-24 17:07:50 -05:00
Linus Torvalds
aa263c43fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Assorted fixes - xattr one from this cycle, the rest - stable fodder"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/pnode.c: treat zero mnt_group_id-s as unequal
  affs_do_readpage_ofs(): just use kmap_atomic() around memcpy()
  xattr handlers: plug a lock leak in simple_xattr_list
  fs: allow no_seek_end_llseek to actually seek
2016-02-24 14:00:26 -08:00
Mike Marshall
adcf34a289 Orangefs: code sanitation
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-24 16:54:27 -05:00
Mike Marshall
d37c0f307a Orangefs: clean up orangefs_kernel_op_s comments.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-24 13:24:14 -05:00
Linus Torvalds
420eb6d7ef Merge tag 'nfs-for-4.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
 "Stable bugfixes:
   - Fix nfs_size_to_loff_t
   - NFSv4: Fix a dentry leak on alias use

  Other bugfixes:
   - Don't schedule a layoutreturn if the layout segment can be freed
     immediately.
   - Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode
   - rpcrdma_bc_receive_call() should init rq_private_buf.len
   - fix stateid handling for the NFS v4.2 operations
   - pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page
   - fix panic in gss_pipe_downcall() in fips mode
   - Fix a race between layoutget and pnfs_destroy_layout
   - Fix a race between layoutget and bulk recalls"

* tag 'nfs-for-4.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4.x/pnfs: Fix a race between layoutget and bulk recalls
  NFSv4.x/pnfs: Fix a race between layoutget and pnfs_destroy_layout
  auth_gss: fix panic in gss_pipe_downcall() in fips mode
  pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page
  nfs4: fix stateid handling for the NFS v4.2 operations
  NFSv4: Fix a dentry leak on alias use
  xprtrdma: rpcrdma_bc_receive_call() should init rq_private_buf.len
  pNFS: Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode
  pNFS: Fix pnfs_mark_matching_lsegs_return()
  nfs: fix nfs_size_to_loff_t
2016-02-23 16:39:21 -08:00
Yunlei He
0ff21646f2 f2fs: avoid hungtask problem caused by losing wake_up
The D state of wait_on_all_pages_writeback should be waken by
function f2fs_write_end_io when all writeback pages have been
succesfully written to device. It's possible that wake_up comes
between get_pages and io_schedule. Maybe in this case it will
lost wake_up and still in D state even if all pages have been
write back to device, and finally, the whole system will be into
the hungtask state.

                if (!get_pages(sbi, F2FS_WRITEBACK))
                         break;
					<---------  wake_up
                io_schedule();

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Biao He <hebiao6@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-23 10:10:09 -08:00
Liu Bo
73beece9ca Btrfs: fix lockdep deadlock warning due to dev_replace
Xfstests btrfs/011 complains about a deadlock warning,

[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955]  (&fs_info->dev_replace.lock){+.+.+.}

and interrupts could create inverse lock ordering between them.

[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
  &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock

[ 1226.652955]  Possible interrupt unsafe locking scenario:

[ 1226.652955]        CPU0                    CPU1
[ 1226.652955]        ----                    ----
[ 1226.652955]   lock(&fs_info->dev_replace.lock);
[ 1226.652955]                                local_irq_disable();
[ 1226.652955]                                lock(&delayed_node->mutex);
[ 1226.652955]                                lock(&found->groups_sem);
[ 1226.652955]   <Interrupt>
[ 1226.652955]     lock(&delayed_node->mutex);
[ 1226.652955]
 *** DEADLOCK ***

Commit 084b6e7c76 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.

The above lock chain comes from
btrfs_commit_transaction
  ->btrfs_run_delayed_items
    ...
    ->__btrfs_update_delayed_inode
      ...
      ->__btrfs_cow_block
         ...
         ->find_free_extent
            ->cache_block_group
              ->load_free_space_cache
                ->btrfs_readpages
                  ->submit_one_bio
                    ...
                    ->__btrfs_map_block
                      ->btrfs_dev_replace_lock

However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to

[ 1226.652955]  [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955]  [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955]  [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700

delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.

To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.

With this, btrfs/011 no more produces warnings in dmesg.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-23 13:10:10 +01:00
David Sterba
d5131b658c btrfs: drop unused argument in btrfs_ioctl_get_supported_features
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-23 12:56:35 +01:00
David Sterba
c5868f8362 btrfs: add GET_SUPPORTED_FEATURES to the control device ioctls
The control device is accessible when no filesystem is mounted and we
may want to query features supported by the module. This is already
possible using the sysfs files, this ioctl is for parity and
convenience.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-23 12:56:21 +01:00
David Sterba
f7e98a7fff btrfs: change max_inline default to 2048
The current practical default is ~4k on x86_64 (the logic is more complex,
simplified for brevity), the inlined files land in the metadata group and
thus consume space that could be needed for the real metadata.

The inlining brings some usability surprises:

1) total space consumption measured on various filesystems and btrfs
   with DUP metadata was quite visible because of the duplicated data
   within metadata

2) inlined data may exhaust the metadata, which are more precious in case
   the entire device space is allocated to chunks (ie. balance cannot
   make the space more compact)

3) performance suffers a bit as the inlined blocks are duplicate and
   stored far away on the device.

Proposed fix: set the default to 2048

This fixes namely 1), the total filesysystem space consumption will be on
par with other filesystems.

Partially fixes 2), more data are pushed to the data block groups.

The characteristics of 3) are based on actual small file size
distribution.

The change is independent of the metadata blockgroup type (though it's
most visible with DUP) or system page size as these parameters are not
trival to find out, compared to file size.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-23 12:55:27 +01:00
David Sterba
11ea474f74 btrfs: remove error message from search ioctl for nonexistent tree
Let's remove the error message that appears when the tree_id is not
present. This can happen with the quota tree and has been observed in
practice. The applications are supposed to handle -ENOENT and we don't
need to report that in the system log as it's not a fatal error.

Reported-by: Vlastimil Babka <vbabka@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-23 12:54:48 +01:00
Arnd Bergmann
f827ba9a64 btrfs: avoid uninitialized variable warning
With CONFIG_SMP and CONFIG_PREEMPT both disabled, gcc decides
to partially inline the get_state_failrec() function but cannot
figure out that means the failrec pointer is always valid
if the function returns success, which causes a harmless
warning:

fs/btrfs/extent_io.c: In function 'clean_io_failure':
fs/btrfs/extent_io.c:2131:4: error: 'failrec' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This marks get_state_failrec() and set_state_failrec() both
as 'noinline', which avoids the warning in all cases for me,
and seems less ugly than adding a fake initialization.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 47dc196ae7 ("btrfs: use proper type for failrec in extent_state")
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-23 12:42:46 +01:00
Chao Yu
7a9d75481b f2fs: trace old block address for CoWed page
This patch enables to trace old block address of CoWed page for better
debugging.

f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4f0, oldaddr = 0xfe8ab, newaddr = 0xfee90 rw = WRITE_SYNC, type = NODE
f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4f8, oldaddr = 0xfe8b0, newaddr = 0xfee91 rw = WRITE_SYNC, type = NODE
f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4fa, oldaddr = 0xfe8ae, newaddr = 0xfee92 rw = WRITE_SYNC, type = NODE

f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x96, oldaddr = 0xf049b, newaddr = 0x2bbe rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x97, oldaddr = 0xf049c, newaddr = 0x2bbf rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x98, oldaddr = 0xf049d, newaddr = 0x2bc0 rw = WRITE, type = DATA

f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x47, oldaddr = 0xffffffff, newaddr = 0xf2631 rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x48, oldaddr = 0xffffffff, newaddr = 0xf2632 rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x49, oldaddr = 0xffffffff, newaddr = 0xf2633 rw = WRITE, type = DATA

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22 21:40:02 -08:00
Chao Yu
9a4cbc9e53 f2fs: try to flush inode after merging inline data
When flushing node pages, if current node page is an inline inode page, we
will try to merge inline data from data page into inline inode page, then
skip flushing current node page, it will decrease the number of nodes to
be flushed in batch in this round, which may lead to worse performance.

This patch gives a chance to flush just merged inline inode pages for
performance.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22 21:40:02 -08:00
Chao Yu
41214b3c77 f2fs: show more info about superblock recovery
This patch changes to show more info in message log about the recovery
of the corrupted superblock during ->mount, e.g. the index of corrupted
superblock and the result of recovery.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22 21:40:01 -08:00
Chao Yu
17d899df46 f2fs: fix the wrong stat count of calling gc
With a partition which was formated as multi segments in one section,
we stated incorrectly for count of gc operation.

e.g., for a partition with segs_per_sec = 4

cat /sys/kernel/debug/f2fs/status

GC calls: 208 (BG: 7)
  - data segments : 104 (52)
  - node segments : 104 (24)

GC called count should be (104 (data segs) + 104 (node segs)) / 4 = 52,
rather than 208. Fix it.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22 21:40:00 -08:00
Jaegeuk Kim
4ce537763e f2fs: remain last victim segment number ascending order
This patch avoids to remain inefficient victim segment number selected by
a victim.

For example, if all the dirty segments has same valid blocks, we can get
the victim segments descending order due to keeping wrong last segment number.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22 21:39:59 -08:00
Shawn Lin
8060656aa3 f2fs: reuse read_inline_data for f2fs_convert_inline_page
f2fs_convert_inline_page introduce what read_inline_data
already does for copying out the inline data from inode_page.
We can use read_inline_data instead to simplify the code.

Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22 21:39:59 -08:00
Chao Yu
993a04996f f2fs: fix to delete old dirent in converted inline directory in ->rename
When doing test with fstests/generic/068 in inline_dentry enabled f2fs,
following oops dmesg will be reported:

 ------------[ cut here ]------------
 WARNING: CPU: 5 PID: 11841 at fs/inode.c:273 drop_nlink+0x49/0x50()
 Modules linked in: f2fs(O) ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state
 CPU: 5 PID: 11841 Comm: fsstress Tainted: G           O    4.5.0-rc1 #45
 Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.61 05/16/2013
  0000000000000111 ffff88009cdf7ae8 ffffffff813e5944 0000000000002e41
  0000000000000000 0000000000000111 0000000000000000 ffff88009cdf7b28
  ffffffff8106a587 ffff88009cdf7b58 ffff8804078fe180 ffff880374a64e00
 Call Trace:
  [<ffffffff813e5944>] dump_stack+0x48/0x64
  [<ffffffff8106a587>] warn_slowpath_common+0x97/0xe0
  [<ffffffff8106a5ea>] warn_slowpath_null+0x1a/0x20
  [<ffffffff81231039>] drop_nlink+0x49/0x50
  [<ffffffffa07b95b4>] f2fs_rename2+0xe04/0x10c0 [f2fs]
  [<ffffffff81231ff1>] ? lock_two_nondirectories+0x81/0x90
  [<ffffffff813f454d>] ? lockref_get+0x1d/0x30
  [<ffffffff81220f70>] vfs_rename+0x2e0/0x640
  [<ffffffff8121f9db>] ? lookup_dcache+0x3b/0xd0
  [<ffffffff810b8e41>] ? update_fast_ctr+0x21/0x40
  [<ffffffff8134ff12>] ? security_path_rename+0xa2/0xd0
  [<ffffffff81224af6>] SYSC_renameat2+0x4b6/0x540
  [<ffffffff810ba8ed>] ? trace_hardirqs_off+0xd/0x10
  [<ffffffff810022ba>] ? exit_to_usermode_loop+0x7a/0xd0
  [<ffffffff817e0ade>] ? int_ret_from_sys_call+0x52/0x9f
  [<ffffffff810bdc90>] ? trace_hardirqs_on_caller+0x100/0x1c0
  [<ffffffff81224b8e>] SyS_renameat2+0xe/0x10
  [<ffffffff8121f08e>] SyS_rename+0x1e/0x20
  [<ffffffff817e0957>] entry_SYSCALL_64_fastpath+0x12/0x6f
 ---[ end trace 2b31e17995404e42 ]---

This is because: in the same inline directory, when we renaming one file
from source name to target name which is not existed, once space of inline
dentry is not enough, inline conversion will be triggered, after that all
data in inline dentry will be moved to normal dentry page.

After attaching the new entry in coverted dentry page, still we try to
remove old entry in original inline dentry, since old entry has been
moved, so it obviously doesn't make any effect, result in remaining old
entry in converted dentry page.

Now, we have two valid dentries pointed to the same inode which has nlink
value of 1, deleting them both, above warning appears.

This issue can be reproduced easily as below steps:
1. mount f2fs with inline_dentry option
2. mkdir dir
3. touch 180 files named [001-180] in dir
4. rename dir/180 dir/181
5. rm dir/180 dir/181

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22 21:39:58 -08:00
Chao Yu
9def1e9216 f2fs: detect error of update_dent_inode in ->rename
Should check and show correct return value of update_dent_inode in
->rename.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22 21:39:57 -08:00
Shawn Lin
984ec63c5a f2fs: move sanity checking of cp into get_valid_checkpoint
>From the function name of get_valid_checkpoint, it seems to return
the valid cp or NULL for caller to check. If no valid one is found,
f2fs_fill_super will print the err log. But if get_valid_checkpoint
get one valid(the return value indicate that it's valid, however actually
it is invalid after sanity checking), then print another similar err
log. That seems strange. Let's keep sanity checking inside the procedure
of geting valid cp. Another improvement we gained from this move is
that even the large volume is supported, we check the cp in advanced
to skip the following procedure if failing the sanity checking.

Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22 21:39:56 -08:00
Shawn Lin
2b39e9072d f2fs: slightly reorganize read_raw_super_block
read_raw_super_block was introduced to help find the
first valid superblock. Commit da554e48ca ("f2fs:
recovering broken superblock during mount") changed the
behaviour to read both of them and check whether need
the recovery flag or not. So the comment before this
function isn't consistent with what it actually does.
Also, the origin code use two tags to round the err
cases, which isn't so readable. So this patch amend
the comment and slightly reorganize it.

Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22 21:39:56 -08:00
Chao Yu
1515aef013 f2fs: reorder nat cache lock in cache_nat_entry
When lookuping nat entry in cache_nat_entry, if we fail to hit nat cache,
we try to load nat entries a) from journal of current segment cache or b)
from NAT pages for updating, during the process, write lock of
nat_tree_lock will be held to avoid inconsistent condition in between
nid cache and nat cache caused by racing among nat entry shrinker,
checkpointer, nat entry updater.

But this way may cause low efficient when updating nat cache, because it
serializes accessing in journal cache or reading NAT pages.

Here, we reorder lock and update flow as below to enhance accessing
concurrency:

 - get_node_info
  - down_read(nat_tree_lock)
  - lookup nat cache --- hit -> unlock & return
  - lookup journal cache --- hit -> unlock & goto update
  - up_read(nat_tree_lock)
update:
  - down_write(nat_tree_lock)
  - cache_nat_entry
   - lookup nat cache --- nohit -> update
  - up_write(nat_tree_lock)

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22 21:39:55 -08:00
Chao Yu
b7ad7512b8 f2fs: split journal cache from curseg cache
In curseg cache, f2fs caches two different parts:
 - datas of current summay block, i.e. summary entries, footer info.
 - journal info, i.e. sparse nat/sit entries or io stat info.

With this approach, 1) it may cause higher lock contention when we access
or update both of the parts of cache since we use the same mutex lock
curseg_mutex to protect the cache. 2) current summary block with last
journal info will be writebacked into device as a normal summary block
when flushing, however, we treat journal info as valid one only in current
summary, so most normal summary blocks contain junk journal data, it wastes
remaining space of summary block.

So, in order to fix above issues, we split curseg cache into two parts:
a) current summary block, protected by original mutex lock curseg_mutex
b) journal cache, protected by newly introduced r/w semaphore journal_rwsem

When loading curseg cache during ->mount, we store summary info and
journal info into different caches; When doing checkpoint, we combine
datas of two cache into current summary block for persisting.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22 21:39:54 -08:00
Chao Yu
e9f5b8b8d6 f2fs: enhance IO path with block plug
Try to use block plug in more place as below to let process cache bios
as much as possbile, in order to reduce lock overhead of queue in IO
scheduler.
1) sync_meta_pages
2) ra_meta_pages
3) f2fs_balance_fs_bg

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22 21:39:54 -08:00
Chao Yu
dfc08a12e4 f2fs: introduce f2fs_journal struct to wrap journal info
Introduce a new structure f2fs_journal to wrap journal info in struct
f2fs_summary_block for readability.

struct f2fs_journal {
	union {
		__le16 n_nats;
		__le16 n_sits;
	};
	union {
		struct nat_journal nat_j;
		struct sit_journal sit_j;
		struct f2fs_extra_info info;
	};
} __packed;

struct f2fs_summary_block {
	struct f2fs_summary entries[ENTRIES_IN_SUM];
	struct f2fs_journal journal;
	struct summary_footer footer;
} __packed;

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-02-22 21:39:53 -08:00