Pul xfs updates from Dave Chinner:
"There's a couple of small API changes to the core DAX code which
required small changes to the ext2 and ext4 code bases, but otherwise
everything is within the XFS codebase.
This update contains:
- A new sparse on-disk inode record format to allow small extents to
be used for inode allocation when free space is fragmented.
- DAX support. This includes minor changes to the DAX core code to
fix problems with lock ordering and bufferhead mapping abuse.
- transaction commit interface cleanup
- removal of various unnecessary XFS specific type definitions
- cleanup and optimisation of freelist preparation before allocation
- various minor cleanups
- bug fixes for
- transaction reservation leaks
- incorrect inode logging in unwritten extent conversion
- mmap lock vs freeze ordering
- remote symlink mishandling
- attribute fork removal issues"
* tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits)
xfs: don't truncate attribute extents if no extents exist
xfs: clean up XFS_MIN_FREELIST macros
xfs: sanitise error handling in xfs_alloc_fix_freelist
xfs: factor out free space extent length check
xfs: xfs_alloc_fix_freelist() can use incore perag structures
xfs: remove xfs_caddr_t
xfs: use void pointers in log validation helpers
xfs: return a void pointer from xfs_buf_offset
xfs: remove inst_t
xfs: remove __psint_t and __psunsigned_t
xfs: fix remote symlinks on V5/CRC filesystems
xfs: fix xfs_log_done interface
xfs: saner xfs_trans_commit interface
xfs: remove the flags argument to xfs_trans_cancel
xfs: pass a boolean flag to xfs_trans_free_items
xfs: switch remaining xfs_trans_dup users to xfs_trans_roll
xfs: check min blks for random debug mode sparse allocations
xfs: fix sparse inodes 32-bit compile failure
xfs: add initial DAX support
xfs: add DAX IO path support
...
Pull btrfs updates from Chris Mason:
"Outside of our usual batch of fixes, this integrates the subvolume
quota updates that Qu Wenruo from Fujitsu has been working on for a
few releases now. He gets an extra gold star for making btrfs smaller
this time, and fixing a number of quota corners in the process.
Dave Sterba tested and integrated Anand Jain's sysfs improvements.
Outside of exporting a symbol (ack'd by Greg) these are all internal
to btrfs and it's mostly cleanups and fixes. Anand also attached some
of our sysfs objects to our internal device management structs instead
of an object off the super block. It will make device management
easier overall and it's a better fit for how the sysfs files are used.
None of the existing sysfs files are moved around.
Thanks for all the fixes everyone"
* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (87 commits)
btrfs: delayed-ref: double free in btrfs_add_delayed_tree_ref()
Btrfs: Check if kobject is initialized before put
lib: export symbol kobject_move()
Btrfs: sysfs: add support to show replacing target in the sysfs
Btrfs: free the stale device
Btrfs: use received_uuid of parent during send
Btrfs: fix use-after-free in btrfs_replay_log
btrfs: wait for delayed iputs on no space
btrfs: qgroup: Make snapshot accounting work with new extent-oriented qgroup.
btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots.
btrfs: ulist: Add ulist_del() function.
btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.
btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.
btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.
btrfs: qgroup: Switch rescan to new mechanism.
btrfs: qgroup: Add new qgroup calculation function btrfs_qgroup_account_extents().
btrfs: backref: Add special time_seq == (u64)-1 case for btrfs_find_all_roots().
btrfs: qgroup: Add new function to record old_roots.
btrfs: qgroup: Record possible quota-related extent for qgroup.
btrfs: qgroup: Add function qgroup_update_counters().
...
Pull more block layer patches from Jens Axboe:
"A few later arrivers that I didn't fold into the first pull request,
so we had a chance to run some testing. This contains:
- NVMe:
- Set of fixes from Keith
- 4.4 and earlier gcc build fix from Andrew
- small set of xen-blk{back,front} fixes from Bob Liu.
- warnings fix for bogus inline statement in I_BDEV() from Geert.
- error code fixup for SG_IO ioctl from Paolo Bonzini"
* 'for-linus' of git://git.kernel.dk/linux-block:
drivers/block/nvme-core.c: fix build with gcc-4.4.4
bdi: Remove "inline" keyword from exported I_BDEV() implementation
block: fix bogus EFAULT error from SG_IO ioctl
NVMe: Fix filesystem deadlock on removal
NVMe: Failed controller initialization fixes
NVMe: Unify controller probe and resume
NVMe: Don't use fake status on cancelled command
NVMe: Fix device cleanup on initialization failure
drivers: xen-blkfront: only talk_to_blkback() when in XenbusStateInitialising
xen/block: add multi-page ring support
driver: xen-blkfront: move talk_to_blkback to a more suitable place
drivers: xen-blkback: delay pending_req allocation to connect_ring
If devpts failed to initialize, it would store an ERR_PTR in the global
devpts_mnt. A subsequent open of /dev/ptmx would call devpts_new_index,
which would dereference devpts_mnt and crash.
Avoid storing invalid values in devpts_mnt; leave it NULL instead. Make
both devpts_new_index and devpts_pty_new fail gracefully with ENODEV in
that case, which then becomes the return value to the userspace open call
on /dev/ptmx.
[akpm@linux-foundation.org: remove unneeded static]
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since patch described below, from v2.6.15-rc1, seq_open() could use a
struct seq_file already allocated by the caller if the pointer to the
structure is stored in file->private_data before calling the function.
Commit 1abe77b0fc
Author: Al Viro <viro@zeniv.linux.org.uk>
Date: Mon Nov 7 17:15:34 2005 -0500
[PATCH] allow callers of seq_open do allocation themselves
Allow caller of seq_open() to kmalloc() seq_file + whatever else they
want and set ->private_data to it. seq_open() will then abstain from
doing allocation itself.
As there's no more use for such feature, as it could be easily replaced by
calls to seq_open_private() (see commit 39699037a5 ("[FS] seq_file:
Introduce the seq_open_private()")) and seq_release_private() (see
v2.6.0-test3), support for this uncommon feature can be removed from
seq_open().
Link: http://lkml.kernel.org/r/cover.1433193673.git.ydroneaud@opteya.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A patchset to remove support for passing pre-allocated struct seq_file to
seq_open(). Such feature is undocumented and prone to error.
In particular, if seq_release() is used in release handler, it will
kfree() a pointer which was not allocated by seq_open().
So this patchset drops support for pre-allocated struct seq_file: it's
only of use in proc_namespace.c and can be easily replaced by using
seq_open_private()/seq_release_private().
Additionally, it documents the use of file->private_data to hold pointer
to struct seq_file by seq_open().
This patch (of 3):
Since patch described below, from v2.6.15-rc1, seq_open() could use a
struct seq_file already allocated by the caller if the pointer to the
structure is stored in file->private_data before calling the function.
Commit 1abe77b0fc
Author: Al Viro <viro@zeniv.linux.org.uk>
Date: Mon Nov 7 17:15:34 2005 -0500
[PATCH] allow callers of seq_open do allocation themselves
Allow caller of seq_open() to kmalloc() seq_file + whatever else they
want and set ->private_data to it. seq_open() will then abstain from
doing allocation itself.
Such behavior is only used by mounts_open_common().
In order to drop support for such uncommon feature, proc_mounts is
converted to use seq_open_private(), which take care of allocating the
proc_mounts structure, making it available through ->private in struct
seq_file.
Conversely, proc_mounts is converted to use seq_release_private(), in
order to release the private structure allocated by seq_open_private().
Then, ->private is used directly instead of proc_mounts() macro to access
to the proc_mounts structure.
Link: http://lkml.kernel.org/r/cover.1433193673.git.ydroneaud@opteya.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit 4f764e5153 ("Btrfs: remove deleted xattrs on fsync log
replay"), we can end up in a situation where during log replay we end up
deleting xattrs that were never deleted when their file was last fsynced.
This happens in the fast fsync path (flag BTRFS_INODE_NEEDS_FULL_SYNC is
not set in the inode) if the inode has the flag BTRFS_INODE_COPY_EVERYTHING
set, the xattr was added in a past transaction and the leaf where the
xattr is located was not updated (COWed or created) in the current
transaction. In this scenario the xattr item never ends up in the log
tree and therefore at log replay time, which makes the replay code delete
the xattr from the fs/subvol tree as it thinks that xattr was deleted
prior to the last fsync.
Fix this by always logging all xattrs, which is the simplest and most
reliable way to detect deleted xattrs and replay the deletes at log replay
time.
This issue is reproducible with the following test case for fstests:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
here=`pwd`
tmp=/tmp/$$
status=1 # failure is the default!
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
trap "_cleanup; exit \$status" 0 1 2 3 15
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
. ./common/attr
# real QA test starts here
# We create a lot of xattrs for a single file. Only btrfs and xfs are currently
# able to store such a large mount of xattrs per file, other filesystems such
# as ext3/4 and f2fs for example, fail with ENOSPC even if we attempt to add
# less than 1000 xattrs with very small values.
_supported_fs btrfs xfs
_supported_os Linux
_need_to_be_root
_require_scratch
_require_dm_flakey
_require_attrs
_require_metadata_journaling $SCRATCH_DEV
rm -f $seqres.full
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create the test file with some initial data and make sure everything is
# durably persisted.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" $SCRATCH_MNT/foo | _filter_xfs_io
sync
# Add many small xattrs to our file.
# We create such a large amount because it's needed to trigger the issue found
# in btrfs - we need to have an amount that causes the fs to have at least 3
# btree leafs with xattrs stored in them, and it must work on any leaf size
# (maximum leaf/node size is 64Kb).
num_xattrs=2000
for ((i = 1; i <= $num_xattrs; i++)); do
name="user.attr_$(printf "%04d" $i)"
$SETFATTR_PROG -n $name -v "val_$(printf "%04d" $i)" $SCRATCH_MNT/foo
done
# Sync the filesystem to force a commit of the current btrfs transaction, this
# is a necessary condition to trigger the bug on btrfs.
sync
# Now update our file's data and fsync the file.
# After a successful fsync, if the fsync log/journal is replayed we expect to
# see all the xattrs we added before with the same values (and the updated file
# data of course). Btrfs used to delete some of these xattrs when it replayed
# its fsync log/journal.
$XFS_IO_PROG -c "pwrite -S 0xbb 8K 16K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount. This makes the fs replay its fsync log.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
echo "File content after crash and log replay:"
od -t x1 $SCRATCH_MNT/foo
echo "File xattrs after crash and log replay:"
for ((i = 1; i <= $num_xattrs; i++)); do
name="user.attr_$(printf "%04d" $i)"
echo -n "$name="
$GETFATTR_PROG --absolute-names -n $name --only-values $SCRATCH_MNT/foo
echo
done
status=0
exit
The golden output expects all xattrs to be available, and with the correct
values, after the fsync log is replayed.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we do an append write to a file (which increases its inode's i_size)
that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
and the previous transaction added a new hard link to the file, which sets
the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
the file, the inode's new i_size isn't logged. This has the consequence
that after the fsync log is replayed, the file size remains what it was
before the append write operation, which means users/applications will
not be able to read the data that was successsfully fsync'ed before.
This happens because neither the inode item nor the delayed inode get
their i_size updated when the append write is made - doing so would
require starting a transaction in the buffered write path, something that
we do not do intentionally for performance reasons.
Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
set the inode is logged with its current i_size (log the in-memory inode
into the log tree).
This issue is not a recent regression and is easy to reproduce with the
following test case for fstests:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
here=`pwd`
tmp=/tmp/$$
status=1 # failure is the default!
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
trap "_cleanup; exit \$status" 0 1 2 3 15
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_supported_fs generic
_supported_os Linux
_need_to_be_root
_require_scratch
_require_dm_flakey
_require_metadata_journaling $SCRATCH_DEV
_crash_and_mount()
{
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount. This makes the fs replay its fsync log.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
}
rm -f $seqres.full
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create the test file with some initial data and then fsync it.
# The fsync here is only needed to trigger the issue in btrfs, as it causes the
# the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
sync
# Add a hard link to our file.
# On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
# which is a necessary condition to trigger the issue.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Sync the filesystem to force a commit of the current btrfs transaction, this
# is a necessary condition to trigger the bug on btrfs.
sync
# Now append more data to our file, increasing its size, and fsync the file.
# In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
# write path did not update the inode item in the btree nor the delayed inode
# item (in memory struture) in the current transaction (created by the fsync
# handler), the fsync did not record the inode's new i_size in the fsync
# log/journal. This made the data unavailable after the fsync log/journal is
# replayed.
$XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
echo "File content after fsync and before crash:"
od -t x1 $SCRATCH_MNT/foo
_crash_and_mount
echo "File content after crash and log replay:"
od -t x1 $SCRATCH_MNT/foo
status=0
exit
The expected file output before and after the crash/power failure expects the
appended data to be available, which is:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
*
0200000
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
While the inode cache caching kthread is calling btrfs_unpin_free_ino(),
we could have a concurrent call to btrfs_return_ino() that adds a new
entry to the root's free space cache of pinned inodes. This concurrent
call does not acquire the fs_info->commit_root_sem before adding a new
entry if the caching state is BTRFS_CACHE_FINISHED, which is a problem
because the caching kthread calls btrfs_unpin_free_ino() after setting
the caching state to BTRFS_CACHE_FINISHED and therefore races with
the task calling btrfs_return_ino(), which is adding a new entry, while
the former (caching kthread) is navigating the cache's rbtree, removing
and freeing nodes from the cache's rbtree without acquiring the spinlock
that protects the rbtree.
This race resulted in memory corruption due to double free of struct
btrfs_free_space objects because both tasks can end up doing freeing the
same objects. Note that adding a new entry can result in merging it with
other entries in the cache, in which case those entries are freed.
This is particularly important as btrfs_free_space structures are also
used for the block group free space caches.
This memory corruption can be detected by a debugging kernel, which
reports it with the following trace:
[132408.501148] slab error in verify_redzone_free(): cache `btrfs_free_space': double free detected
[132408.505075] CPU: 15 PID: 12248 Comm: btrfs-ino-cache Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
[132408.505075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[132408.505075] ffff880023e7d320 ffff880163d73cd8 ffffffff8145eec7 ffffffff81095dce
[132408.505075] ffff880009735d40 ffff880163d73ce8 ffffffff81154e1e ffff880163d73d68
[132408.505075] ffffffff81155733 ffffffffa054a95a ffff8801b6099f00 ffffffffa0505b5f
[132408.505075] Call Trace:
[132408.505075] [<ffffffff8145eec7>] dump_stack+0x4f/0x7b
[132408.505075] [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2
[132408.505075] [<ffffffff81154e1e>] __slab_error.isra.28+0x25/0x36
[132408.505075] [<ffffffff81155733>] __cache_free+0xe2/0x4b6
[132408.505075] [<ffffffffa054a95a>] ? __btrfs_add_free_space+0x2f0/0x343 [btrfs]
[132408.505075] [<ffffffffa0505b5f>] ? btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075] [<ffffffff810f3b30>] ? time_hardirqs_off+0x15/0x28
[132408.505075] [<ffffffff81084d42>] ? trace_hardirqs_off+0xd/0xf
[132408.505075] [<ffffffff811563a1>] ? kfree+0xb6/0x14e
[132408.505075] [<ffffffff811563d0>] kfree+0xe5/0x14e
[132408.505075] [<ffffffffa0505b5f>] btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075] [<ffffffffa0505e08>] caching_kthread+0x29e/0x2d9 [btrfs]
[132408.505075] [<ffffffffa0505b6a>] ? btrfs_unpin_free_ino+0x99/0x99 [btrfs]
[132408.505075] [<ffffffff8106698f>] kthread+0xef/0xf7
[132408.505075] [<ffffffff810f3b08>] ? time_hardirqs_on+0x15/0x28
[132408.505075] [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
[132408.505075] [<ffffffff814653d2>] ret_from_fork+0x42/0x70
[132408.505075] [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
[132408.505075] ffff880023e7d320: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.
[132409.501654] slab: double free detected in cache 'btrfs_free_space', objp ffff880023e7d320
[132409.503355] ------------[ cut here ]------------
[132409.504241] kernel BUG at mm/slab.c:2571!
Therefore fix this by having btrfs_unpin_free_ino() acquire the lock
that protects the rbtree while doing the searches and removing entries.
Fixes: 1c70d8fb4d ("Btrfs: fix inode caching vs tree log")
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The free space entries are allocated using kmem_cache_zalloc(),
through __btrfs_add_free_space(), therefore we should use
kmem_cache_free() and not kfree() to avoid any confusion and
any potential problem. Looking at the kfree() definition at
mm/slab.c it has the following comment:
/*
* (...)
*
* Don't free memory not originally allocated by kmalloc()
* or you will run into trouble.
*/
So better be safe and use kmem_cache_free().
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Although it is a rare case, we'd better free previous allocated
memory on error.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
It is introduced by:
c404e0dc2c
Btrfs: fix use-after-free in the finishing procedure of the device replace
But seems no relationship with that bug, this patch revirt these
code block for cleanup.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently, we can only set a limitation on a qgroup, but we
can not clear it.
This patch provide a choice to user to clear a limitation on
qgroup by passing a value of CLEAR_VALUE(-1) to kernel.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
In a dfs setup where the client transitions from a server which supports
posix paths to a server which doesn't support posix paths, the flag
CIFS_MOUNT_POSIX_PATHS is not reset. This leads to the wrong directory
separator being used causing smb commands to fail.
Consider the following case where a dfs share on a samba server points
to a share on windows smb server.
# mount -t cifs -o .. //vm140-31/dfsroot/testwin/
# ls -l /mnt; touch /mnt/a
total 0
touch: cannot touch ‘/mnt/a’: No such file or directory
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Acked-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <steve.french@primarydata.com>
Pull libnvdimm subsystem from Dan Williams:
"The libnvdimm sub-system introduces, in addition to the
libnvdimm-core, 4 drivers / enabling modules:
NFIT:
Instantiates an "nvdimm bus" with the core and registers memory
devices (NVDIMMs) enumerated by the ACPI 6.0 NFIT (NVDIMM Firmware
Interface table).
After registering NVDIMMs the NFIT driver then registers "region"
devices. A libnvdimm-region defines an access mode and the
boundaries of persistent memory media. A region may span multiple
NVDIMMs that are interleaved by the hardware memory controller. In
turn, a libnvdimm-region can be carved into a "namespace" device and
bound to the PMEM or BLK driver which will attach a Linux block
device (disk) interface to the memory.
PMEM:
Initially merged in v4.1 this driver for contiguous spans of
persistent memory address ranges is re-worked to drive
PMEM-namespaces emitted by the libnvdimm-core.
In this update the PMEM driver, on x86, gains the ability to assert
that writes to persistent memory have been flushed all the way
through the caches and buffers in the platform to persistent media.
See memcpy_to_pmem() and wmb_pmem().
BLK:
This new driver enables access to persistent memory media through
"Block Data Windows" as defined by the NFIT. The primary difference
of this driver to PMEM is that only a small window of persistent
memory is mapped into system address space at any given point in
time.
Per-NVDIMM windows are reprogrammed at run time, per-I/O, to access
different portions of the media. BLK-mode, by definition, does not
support DAX.
BTT:
This is a library, optionally consumed by either PMEM or BLK, that
converts a byte-accessible namespace into a disk with atomic sector
update semantics (prevents sector tearing on crash or power loss).
The sinister aspect of sector tearing is that most applications do
not know they have a atomic sector dependency. At least today's
disk's rarely ever tear sectors and if they do one almost certainly
gets a CRC error on access. NVDIMMs will always tear and always
silently. Until an application is audited to be robust in the
presence of sector-tearing the usage of BTT is recommended.
Thanks to: Ross Zwisler, Jeff Moyer, Vishal Verma, Christoph Hellwig,
Ingo Molnar, Neil Brown, Boaz Harrosh, Robert Elliott, Matthew Wilcox,
Andy Rudoff, Linda Knippers, Toshi Kani, Nicholas Moulin, Rafael
Wysocki, and Bob Moore"
* tag 'libnvdimm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm: (33 commits)
arch, x86: pmem api for ensuring durability of persistent memory updates
libnvdimm: Add sysfs numa_node to NVDIMM devices
libnvdimm: Set numa_node to NVDIMM devices
acpi: Add acpi_map_pxm_to_online_node()
libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only
pmem: flag pmem block devices as non-rotational
libnvdimm: enable iostat
pmem: make_request cleanups
libnvdimm, pmem: fix up max_hw_sectors
libnvdimm, blk: add support for blk integrity
libnvdimm, btt: add support for blk integrity
fs/block_dev.c: skip rw_page if bdev has integrity
libnvdimm: Non-Volatile Devices
tools/testing/nvdimm: libnvdimm unit test infrastructure
libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
nd_btt: atomic sector updates
libnvdimm: infrastructure for btt devices
libnvdimm: write blk label set
libnvdimm: write pmem label set
libnvdimm: blk labels and namespace instantiation
...
The only caller that cares about its return value can just
as easily pick it from nd->root_seq itself. We used to just
calculate it and return to caller, but these days we are
storing it in nd->root_seq in all cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It is possible to have an active open with one mode, and a delegation
for the same file with a different mode.
In particular, a WR_ONLY open and an RD_ONLY delegation.
This happens if a WR_ONLY open is followed by a RD_ONLY open which
provides a delegation, but is then close.
When returning the delegation, we currently try to claim opens for
every open type (n_rdwr, n_rdonly, n_wronly). As there is no harm
in claiming an open for a mode that we already have, this is often
simplest.
However if the delegation only provides a subset of the modes that we
currently have open, this will produce an error from the server.
So when claiming open modes prior to returning a delegation, skip the
open request if the mode is not covered by the delegation - the open_stateid
must already cover that mode, so there is nothing to do.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Send negotiate contexts when SMB3.11 dialect is negotiated
(ie the preauth and the encryption contexts) and
Initialize SMB3.11 preauth negotiate context salt to random bytes
Followon patch will update session setup and tree connect
Signed-off-by: Steve French <steve.french@primarydata.com>
set integrity increases reliability of files stored on SMB3 servers.
Add ioctl to allow setting this on files on SMB3 and later mounts.
Signed-off-by: Steve French <steve.french@primarydata.com>
Getting fantastic copy performance with cp --reflink over SMB3.11
using the new FSCTL_DUPLICATE_EXTENTS.
This FSCTL was added in the SMB3.11 dialect (testing was
against REFS file system) so have put it as a 3.11 protocol
specific operation ("vers=3.1.1" on the mount). Tested at
the SMB3 plugfest in Redmond.
It depends on the new FS Attribute (BLOCK_REFCOUNTING) which
is used to advertise support for the ability to do this ioctl
(if you can support multiple files pointing to the same block
than this refcounting ability or equivalent is needed to
support the new reflink-like duplicate extent SMB3 ioctl.
Signed-off-by: Steve French <steve.french@primarydata.com>
Pull UML updates from Richard Weinberger:
- remove hppfs ("HonePot ProcFS")
- initial support for musl libc
- uaccess cleanup
- random cleanups and bug fixes all over the place
* 'for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: (21 commits)
um: Don't pollute kernel namespace with uapi
um: Include sys/types.h for makedev(), major(), minor()
um: Do not use stdin and stdout identifiers for struct members
um: Do not use __ptr_t type for stack_t's .ss pointer
um: Fix mconsole dependency
um: Handle tracehook_report_syscall_entry() result
um: Remove copy&paste code from init.h
um: Stop abusing __KERNEL__
um: Catch unprotected user memory access
um: Fix warning in setup_signal_stack_si()
um: Rework uaccess code
um: Add uaccess.h to ldt.c
um: Add uaccess.h to syscalls_64.c
um: Add asm/elf.h to vma.c
um: Cleanup mem_32/64.c headers
um: Remove hppfs
um: Move syscall() declaration into os.h
um: kernel: ksyms: Export symbol syscall() for fixing modpost issue
um/os-Linux: Use char[] for syscall_stub declarations
um: Use char[] for linker script address declarations
...
Most people think of SMB 3.1.1 as SMB version 3.11 so add synonym
for "vers=3.1.1" of "vers=3.11" on mount.
Also make sure that unlike SMB3.0 and 3.02 we don't send
validate negotiate on mount (it is handled by negotiate contexts) -
add list of SMB3.11 specific functions (distinct from 3.0 dialect).
Signed-off-by: Steve French <steve.french@primarydata.com>w
Add new structures and defines for SMB3.11 negotiate, session setup and tcon
See MS-SMB2-diff.pdf section 2.2.3 for additional protocol documentation.
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
Parses and recognizes "vers=3.1.1" on cifs mount and allows sending
0x0311 as a new CIFS/SMB3 dialect. Subsequent patches will add
the new negotiate contexts and updated session setup
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
[MS-SMB] 2.2.4.5.2.1 states:
"ChallengeLength (1 byte): When the CAP_EXTENDED_SECURITY bit is set,
the server MUST set this value to zero and clients MUST ignore this
value."
Signed-off-by: Noel Power <noel.power@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Pull security subsystem updates from James Morris:
"The main change in this kernel is Casey's generalized LSM stacking
work, which removes the hard-coding of Capabilities and Yama stacking,
allowing multiple arbitrary "small" LSMs to be stacked with a default
monolithic module (e.g. SELinux, Smack, AppArmor).
See
https://lwn.net/Articles/636056/
This will allow smaller, simpler LSMs to be incorporated into the
mainline kernel and arbitrarily stacked by users. Also, this is a
useful cleanup of the LSM code in its own right"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
tpm, tpm_crb: fix le64_to_cpu conversions in crb_acpi_add()
vTPM: set virtual device before passing to ibmvtpm_reset_crq
tpm_ibmvtpm: remove unneccessary message level.
ima: update builtin policies
ima: extend "mask" policy matching support
ima: add support for new "euid" policy condition
ima: fix ima_show_template_data_ascii()
Smack: freeing an error pointer in smk_write_revoke_subj()
selinux: fix setting of security labels on NFS
selinux: Remove unused permission definitions
selinux: enable genfscon labeling for sysfs and pstore files
selinux: enable per-file labeling for debugfs files.
selinux: update netlink socket classes
signals: don't abuse __flush_signals() in selinux_bprm_committed_creds()
selinux: Print 'sclass' as string when unrecognized netlink message occurs
Smack: allow multiple labels in onlycap
Smack: fix seq operations in smackfs
ima: pass iint to ima_add_violation()
ima: wrap event related data to the new ima_event_data structure
integrity: add validity checks for 'path' parameter
...
With gcc 3.4.6/4.1.2/4.2.4 (not with 4.4.7/4.6.4/4.8.4):
CC fs/block_dev.o
include/linux/fs.h:804: warning: ‘I_BDEV’ declared inline after being called
include/linux/fs.h:804: warning: previous declaration of ‘I_BDEV’ was here
Commit a212b105b0 ("bdi: make inode_to_bdi() inline") added a
caller of I_BDEV() in a header file, exposing the bogus "inline" on the
exported implementation.
Drop the "inline" keyword to fix this.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull nfsd updates from Bruce Fields:
"A relatively quiet cycle, with a mix of cleanup and smaller bugfixes"
* 'for-4.2' of git://linux-nfs.org/~bfields/linux: (24 commits)
sunrpc: use sg_init_one() in krb5_rc4_setup_enc/seq_key()
nfsd: wrap too long lines in nfsd4_encode_read
nfsd: fput rd_file from XDR encode context
nfsd: take struct file setup fully into nfs4_preprocess_stateid_op
nfsd: refactor nfs4_preprocess_stateid_op
nfsd: clean up raparams handling
nfsd: use swap() in sort_pacl_range()
rpcrdma: Merge svcrdma and xprtrdma modules into one
svcrdma: Add a separate "max data segs macro for svcrdma
svcrdma: Replace GFP_KERNEL in a loop with GFP_NOFAIL
svcrdma: Keep rpcrdma_msg fields in network byte-order
svcrdma: Fix byte-swapping in svc_rdma_sendto.c
nfsd: Update callback sequnce id only CB_SEQUENCE success
nfsd: Reset cb_status in nfsd4_cb_prepare() at retrying
svcrdma: Remove svc_rdma_xdr_decode_deferred_req()
SUNRPC: Move EXPORT_SYMBOL for svc_process
uapi/nfs: Add NFSv4.1 ACL definitions
nfsd: Remove dead declarations
nfsd: work around a gcc-5.1 warning
nfsd: Checking for acl support does not require fetching any acls
...
Pull GFS2 updates from Bob Peterson:
"Here are the patches we've accumulated for GFS2 for the current
upstream merge window. We have a good mixture this time. Here are
some of the features:
- Fix a problem with RO mounts writing to the journal.
- Further improvements to quotas on GFS2.
- Added support for rename2 and RENAME_EXCHANGE on GFS2.
- Increase performance by making glock lru_list less of a bottleneck.
- Increase performance by avoiding unnecessary buffer_head releases.
- Increase performance by using average glock round trip time from all CPUs.
- Fixes for some compiler warnings and minor white space issues.
- Other misc bug fixes"
* tag 'gfs2-merge-window' of git://git.kernel.org:/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: Don't brelse rgrp buffer_heads every allocation
GFS2: Don't add all glocks to the lru
gfs2: Don't support fallocate on jdata files
gfs2: s64 cast for negative quota value
gfs2: limit quota log messages
gfs2: fix quota updates on block boundaries
gfs2: fix shadow warning in gfs2_rbm_find()
gfs2: kerneldoc warning fixes
gfs2: convert simple_str to kstr
GFS2: make sure S_NOSEC flag isn't overwritten
GFS2: add support for rename2 and RENAME_EXCHANGE
gfs2: handle NULL rgd in set_rgrp_preferences
GFS2: inode.c: indent with TABs, not spaces
GFS2: mark the journal idle to fix ro mounts
GFS2: Average in only non-zero round-trip times for congestion stats
GFS2: Use average srttb value in congestion calculations
This reverts commit 2143c1965a.
This commit seems to be the cause of the following jbd2 assertion
failure:
------------[ cut here ]------------
kernel BUG at fs/jbd2/transaction.c:1325!
invalid opcode: 0000 [#1] SMP
Modules linked in: bnep bluetooth fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 ...
CPU: 7 PID: 5509 Comm: gcc Not tainted 4.1.0-10944-g2a298679b411 #1
Hardware name: /DH87RL, BIOS RLH8710H.86A.0327.2014.0924.1645 09/24/2014
task: ffff8803bf866040 ti: ffff880308528000 task.ti: ffff880308528000
RIP: jbd2_journal_dirty_metadata+0x237/0x290
Call Trace:
__ext4_handle_dirty_metadata+0x43/0x1f0
ext4_handle_dirty_dirent_node+0xde/0x160
? jbd2_journal_get_write_access+0x36/0x50
ext4_delete_entry+0x112/0x160
? __ext4_journal_start_sb+0x52/0xb0
ext4_unlink+0xfa/0x260
vfs_unlink+0xec/0x190
do_unlinkat+0x24a/0x270
SyS_unlink+0x11/0x20
entry_SYSCALL_64_fastpath+0x12/0x6a
---[ end trace ae033ebde8d080b4 ]---
which is not easily reproducible (I've seen it just once, and then Ted
was able to reproduce it once). Revert it while Ted and Jan try to
figure out what is wrong.
Cc: Jan Kara <jack@suse.cz>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make it so, by checking the return value for NFS4ERR_MOTSUPP and
caching the information as a server capability.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
According to the spec, the server is only returning the status,
which we decode in the op header.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull cgroup updates from Tejun Heo:
- threadgroup_lock got reorganized so that its users can pick the
actual locking mechanism to use. Its only user - cgroups - is
updated to use a percpu_rwsem instead of per-process rwsem.
This makes things a bit lighter on hot paths and allows cgroups to
perform and fail multi-task (a process) migrations atomically.
Multi-task migrations are used in several places including the
unified hierarchy.
- Delegation rule and documentation added to unified hierarchy. This
will likely be the last interface update from the cgroup core side
for unified hierarchy before lifting the devel mask.
- Some groundwork for the pids controller which is scheduled to be
merged in the coming devel cycle.
* 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: add delegation section to unified hierarchy documentation
cgroup: require write perm on common ancestor when moving processes on the default hierarchy
cgroup: separate out cgroup_procs_write_permission() from __cgroup_procs_write()
kernfs: make kernfs_get_inode() public
MAINTAINERS: add a cgroup core co-maintainer
cgroup: fix uninitialised iterator in for_each_subsys_which
cgroup: replace explicit ss_mask checking with for_each_subsys_which
cgroup: use bitmask to filter for_each_subsys
cgroup: add seq_file forward declaration for struct cftype
cgroup: simplify threadgroup locking
sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem
sched, cgroup: reorganize threadgroup locking
cgroup: switch to unsigned long for bitmasks
cgroup: reorganize include/linux/cgroup.h
cgroup: separate out include/linux/cgroup-defs.h
cgroup: fix some comment typos
Pull driver core updates from Greg KH:
"Here is the driver core / firmware changes for 4.2-rc1.
A number of small changes all over the place in the driver core, and
in the firmware subsystem. Nothing really major, full details in the
shortlog. Some of it is a bit of churn, given that the platform
driver probing changes was found to not work well, so they were
reverted.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (31 commits)
Revert "base/platform: Only insert MEM and IO resources"
Revert "base/platform: Continue on insert_resource() error"
Revert "of/platform: Use platform_device interface"
Revert "base/platform: Remove code duplication"
firmware: add missing kfree for work on async call
fs: sysfs: don't pass count == 0 to bin file readers
base:dd - Fix for typo in comment to function driver_deferred_probe_trigger().
base/platform: Remove code duplication
of/platform: Use platform_device interface
base/platform: Continue on insert_resource() error
base/platform: Only insert MEM and IO resources
firmware: use const for remaining firmware names
firmware: fix possible use after free on name on asynchronous request
firmware: check for file truncation on direct firmware loading
firmware: fix __getname() missing failure check
drivers: of/base: move of_init to driver_init
drivers/base: cacheinfo: fix annoying typo when DT nodes are absent
sysfs: disambiguate between "error code" and "failure" in comments
driver-core: fix build for !CONFIG_MODULES
driver-core: make __device_attach() static
...
This patch ensures that we record the value of 'ffl_flags' from
the layout, and then checks for the presence of the
FF_FLAGS_NO_LAYOUTCOMMIT flag before deciding whether or not to
call pnfs_set_layoutcommit().
The effect is that servers now can decide whether or not they want
the client to call layoutcommit before returning a writeable layout.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* layoutstats:
pnfs/flexfiles: protect ktime manipulation with mirror lock
nfs: provide pnfs_report_layoutstat when NFS42 is disabled
pnfs/flexfiles: report layoutstat regularly
nfs42: serialize LAYOUTSTATS calls of the same file
pnfs/flexfiles: encode LAYOUTSTATS flexfiles specific data
pnfs/flexfiles: add ff_layout_prepare_layoutstats
pNFS/flexfiles: track when layout is first used
pNFS/flexfiles: add layoutstats tracking
pNFS/flexfiles: Remove unused struct members user_name, group_name
pnfs: add pnfs_report_layoutstat helper function
pNFS: fill in nfs42_layoutstat_ops
NFSv.2/pnfs Add a LAYOUTSTATS rpc function
It looks as if xchg() and cmpxchg() are not available for 64-bit integers on sparc32:
> New breakage seen in linux-next today:
>
> ERROR: "__xchg_called_with_bad_pointer" [fs/nfs/flexfilelayout/nfs_layout_flexfiles.ko] undefined!
> ERROR: "__cmpxchg_called_with_bad_pointer" [fs/nfs/flexfilelayout/nfs_layout_flexfiles.ko] undefined!
> make[2]: *** [__modpost] Error 1
> make[1]: *** [modules] Error 2
Given that mirror ktime manipulation is already under mirror->lock, let's make use of the fact.
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
kbuild test robot reported:
fs/built-in.o: In function `pnfs_report_layoutstat':
>> (.text+0x151a1c): undefined reference to `nfs42_proc_layoutstats_generic'
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Merge second patchbomb from Andrew Morton:
- most of the rest of MM
- lots of misc things
- procfs updates
- printk feature work
- updates to get_maintainer, MAINTAINERS, checkpatch
- lib/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
exit,stats: /* obey this comment */
coredump: add __printf attribute to cn_*printf functions
coredump: use from_kuid/kgid when formatting corename
fs/reiserfs: remove unneeded cast
NILFS2: support NFSv2 export
fs/befs/btree.c: remove unneeded initializations
fs/minix: remove unneeded cast
init/do_mounts.c: add create_dev() failure log
kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
fs/efs: femove unneeded cast
checkpatch: emit "NOTE: <types>" message only once after multiple files
checkpatch: emit an error when there's a diff in a changelog
checkpatch: validate MODULE_LICENSE content
checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
checkpatch: fix processing of MEMSET issues
checkpatch: suggest using ether_addr_equal*()
checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
checkpatch: remove local from codespell path
checkpatch: add --showfile to allow input via pipe to show filenames
...