The user_ns is moved from nsproxy to user_struct, so that a struct
cred by itself is sufficient to determine access (which it otherwise
would not be). Corresponding ecryptfs fixes (by David Howells) are
here as well.
Fix refcounting. The following rules now apply:
1. The task pins the user struct.
2. The user struct pins its user namespace.
3. The user namespace pins the struct user which created it.
User namespaces are cloned during copy_creds(). Unsharing a new user_ns
is no longer possible. (We could re-add that, but it'll cause code
duplication and doesn't seem useful if PAM doesn't need to clone user
namespaces).
When a user namespace is created, its first user (uid 0) gets empty
keyrings and a clean group_info.
This incorporates a previous patch by David Howells. Here
is his original patch description:
>I suggest adding the attached incremental patch. It makes the following
>changes:
>
> (1) Provides a current_user_ns() macro to wrap accesses to current's user
> namespace.
>
> (2) Fixes eCryptFS.
>
> (3) Renames create_new_userns() to create_user_ns() to be more consistent
> with the other associated functions and because the 'new' in the name is
> superfluous.
>
> (4) Moves the argument and permission checks made for CLONE_NEWUSER to the
> beginning of do_fork() so that they're done prior to making any attempts
> at allocation.
>
> (5) Calls create_user_ns() after prepare_creds(), and gives it the new creds
> to fill in rather than have it return the new root user. I don't imagine
> the new root user being used for anything other than filling in a cred
> struct.
>
> This also permits me to get rid of a get_uid() and a free_uid(), as the
> reference the creds were holding on the old user_struct can just be
> transferred to the new namespace's creator pointer.
>
> (6) Makes create_user_ns() reset the UIDs and GIDs of the creds under
> preparation rather than doing it in copy_creds().
>
>David
>Signed-off-by: David Howells <dhowells@redhat.com>
Changelog:
Oct 20: integrate dhowells comments
1. leave thread_keyring alone
2. use current_user_ns() in set_user()
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Since commit c98451bd, the loop in nlm_lookup_host() unconditionally
compares the host's h_srcaddr field to the incoming source address.
For client-side nlm_host entries, both are always AF_UNSPEC, so this
check is unnecessary.
Since commit 781b61a6, which added support for AF_INET6 addresses to
nlm_cmp_addr(), nlm_cmp_addr() now returns FALSE for AF_UNSPEC
addresses, which causes nlm_lookup_host() to create a fresh nlm_host
entry every time it is called on the client.
These extra entries will eventually expire once the server is
unmounted, so the impact of this regression, introduced with lockd
IPv6 support in 2.6.28, should be minor.
We could fix this by adding an arm in nlm_cmp_addr() for AF_UNSPEC
addresses, but really, nlm_lookup_host() shouldn't be matching on the
srcaddr field for client-side nlm_host lookups.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
If nfsd was shut down before the grace period ended, we could end up
with a freed object still on grace_list. Thanks to Jeff Moyer for
reporting the resulting list corruption warnings.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Tested-by: Jeff Moyer <jmoyer@redhat.com>
In ext4_mb_init_group(), if the filesystem block size is less than
PAGE_SIZE/2, the code tries to grab alloc_sem for multiple block
groups in a loop. We need to allow for this by using
down_write_nested() and passing in the loop index as a lock subclass
number. This works because no other code path needs to take multiple
alloc_sem's. Note that lockdep will fail for filesystem blocksize
smaller than to PAGE_SIZE/16k. (e.g., a 1k filesystem blocksize with
a 32k page size, or a 2k filesystem blocksize with a 64k blocksize,
etc.)
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When we generate buddy cache (especially during resize) we need to
make sure we don't use the blocks freed but not yet comitted. This
makes sure we have the right value of free blocks count in the group
info and also in the bitmap. This also ensures the ordered mode
consistency
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
To avoid memory allocation failure during bulk-read, pre-allocate
a bulk-read buffer, so that if there is only one bulk-reader at
a time, it would just use the pre-allocated buffer and would not
do any memory allocation. However, if there are more than 1 bulk-
reader, then only one reader would use the pre-allocated buffer,
while the other reader would allocate the buffer for itself.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Bulk-read allocates 128KiB or more using kmalloc. The allocation
starts failing often when the memory gets fragmented. UBIFS still
works fine in this case, because it falls-back to standard
(non-optimized) read method, though. This patch teaches bulk-read
to allocate exactly the amount of memory it needs, instead of
allocating 128KiB every time.
This patch is also a preparation to the further fix where we'll
have a pre-allocated bulk-read buffer as well. For example, now
the @bu object is prepared in 'ubifs_bulk_read()', so we could
path either pre-allocated or allocated information to
'ubifs_do_bulk_read()' later. Or teaching 'ubifs_do_bulk_read()'
not to allocate 'bu->buf' if it is already there.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Bulk-read allocates a lot of memory with 'kmalloc()', and when it
is/gets fragmented 'kmalloc()' fails with a scarry warning. But
because bulk-read is just an optimization, UBIFS keeps working fine.
Supress the warning by passing __GFP_NOWARN option to 'kmalloc()'.
This patch also introduces a macro for the magic 128KiB constant.
This is just neater.
Note, this is not really fixes the problem we had, but just hides
the warnings. The further patches fix the problem.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[CIFS] Do not attempt to close invalidated file handles
[CIFS] fix check for dead tcon in smb_init
If a connection with open file handles has gone down
and come back up and reconnected without reopening
the file handle yet, do not attempt to send an SMB close
request for this handle in cifs_close. We were
checking for the connection being invalid in cifs_close
but since the connection may have been reconnected
we also need to check whether the file handle
was marked invalid (otherwise we could close the
wrong file handle by accident).
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This the lockdep complaint by having a different mutex to gaurd caching the
block group, so you don't end up with this backwards dependancy. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
The btrfs write_cache_pages call has a flush function so that it submits
the bio it has been building before it waits on any writeback pages.
This adds a check so that flush only happens on writeback pages.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The log replay produces dirty roots. These dirty roots
should be dropped immediately if the fs is mounted as
ro. Otherwise they can be added to the dirty root list
again when remounting the fs as rw. Thank you,
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
The btrfs git kernel trees is used to build a standalone tree for
compiling against older kernels. This commit makes the standalone tree
work with 2.6.27
Signed-off-by: Chris Mason <chris.mason@oracle.com>
fs/hostfs/hostfs_user.c defines do_readlink() as non-static, and so does
fs/xfs/linux-2.6/xfs_ioctl.c when CONFIG_XFS_DEBUG=y. So rename
do_readlink() in hostfs to hostfs_do_readlink().
I think it's better if XFS guys will also rename their do_readlink(),
it's not necessary to use such a general name.
Signed-off-by: WANG Cong <wangcong@zeuux.org>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Cordes is sorry that he rm'ed his swapfiles while they were in use,
he then had no pathname to swapoff. It's a curious little oversight, but
not one worth a lot of hackery. Kudos to Willy Tarreau for turning this
around from a discussion of synthetic pathnames to how to prevent unlink.
Mimic immutable: prohibit unlinking an active swapfile in may_delete()
(and don't worry my little head over the tiny race window).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Willy Tarreau <w@1wt.eu>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Peter Cordes <peter@cordes.ca>
Cc: Bodo Eggert <7eggert@gmx.de>
Cc: David Newall <davidn@davidnewall.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I have received some reports of out-of-memory errors on some older AMD
architectures. These errors are what I would expect to see if
crypt_stat->key were split between two separate pages. eCryptfs should
not assume that any of the memory sent through virt_to_scatterlist() is
all contained in a single page, and so this patch allocates two
scatterlist structs instead of one when processing keys. I have received
confirmation from one person affected by this bug that this patch resolves
the issue for him, and so I am submitting it for inclusion in a future
stable release.
Note that virt_to_scatterlist() runs sg_init_table() on the scatterlist
structs passed to it, so the calls to sg_init_table() in
decrypt_passphrase_encrypted_session_key() are redundant.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Reported-by: Paulo J. S. Silva <pjssilva@ime.usp.br>
Cc: "Leon Woestenberg" <leon.woestenberg@gmail.com>
Cc: Tim Gardner <tim.gardner@canonical.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* open/close_bdev_excl -> open/close_bdev_exclusive
* blkdev_issue_discard takes a GFP mask now
* Fix blkdev_issue_discard usage now that it is enabled
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch fixes what I hope is the last early ENOSPC bug left. I did not know
that pinned extents would merge into one big extent when inserted on to the
pinned extent tree, so I was adding free space to a block group that could
possibly span multiple block groups.
This is a big issue because first that space doesn't exist in that block group,
and second we won't actually use that space because there are a bunch of other
checks to make sure we're allocating within the constraints of the block group.
This patch fixes the problem by adding the btrfs_add_free_space to
btrfs_update_pinned_extents which makes sure we are adding the appropriate
amount of free space to the appropriate block group. Thanks much to Lee Trager
for running my myriad of debug patches to help me track this problem down.
Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
fsync log replay can change the filesystem, so it cannot be delayed until
mount -o rw,remount, and it can't be forgotten entirely. So, this patch
changes btrfs to do with reiserfs, ext3 and xfs do, which is to do the
log replay even when mounted readonly.
On a readonly device if log replay is required, the mount is aborted.
Getting all of this right had required fixing up some of the error
handling in open_ctree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While building large bios in writepages, btrfs may end up waiting
for other page writeback to finish if WB_SYNC_ALL is used.
While it is waiting, the bio it is building has a number of pages with the
writeback bit set and they aren't getting to the disk any time soon. This
lowers the latencies of writeback in general by sending down the bio being
built before waiting for other pages.
The bio submission code tries to limit the total number of async bios in
flight by waiting when we're over a certain number of async bios. But,
the waits are happening while writepages is building bios, and this can easily
lead to stalls and other problems for people calling wait_on_page_writeback.
The current fix is to let the congestion tests take care of waiting.
sync() and others make sure to drain the current async requests to make
sure that everything that was pending when the sync was started really get
to disk. The code would drain pending requests both before and after
submitting a new request.
But, if one of the requests is waiting for page writeback to finish,
the draining waits might block that page writeback. This changes the
draining code to only wait after submitting the bio being processed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent based waiting was using more CPU, and other fixes have helped
with the unplug storm problems.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This was recently changed to check for need_reconnect, but should
actually be a check for a tidStatus of CifsExiting.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
For larger multi-device filesystems, there was logic to limit the
number of devices unplugged to just the page that was sent to our sync_page
function.
But, the code wasn't always unplugging the right device. Since this was
just an optimization, disable it for now.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In insert_extents(), when ret==1 and last is not zero, it should
check if the current inserted item is the last item in this batching
inserts. If so, it should just break from loop. If not, 'cur =
insert_list->next' will make no sense because the list is empty now,
and 'op' will point to an unexpectable place.
There are also some trivial fixs in this patch including one comment
typo error and deleting two redundant lines.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Block ext devt conversion missed md_autodetect_dev() call in
rescan_partitions() leaving md autodetect unable to see partitions.
Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Make add_partition() return pointer to the new hd_struct on success
and ERR_PTR() value on failure. This change will be used to fix md
autodetection bug.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
prevent cifs_writepages() from skipping unwritten pages
Fixed parsing of mount options when doing DFS submount
[CIFS] Fix check for tcon seal setting and fix oops on failed mount from earlier patch
[CIFS] Fix build break
cifs: reinstate sharing of tree connections
[CIFS] minor cleanup to cifs_mount
cifs: reinstate sharing of SMB sessions sans races
cifs: disable sharing session and tcon and add new TCP sharing code
[CIFS] clean up server protocol handling
[CIFS] remove unused list, add new cifs sock list to prepare for mount/umount fix
[CIFS] Fix cifs reconnection flags
[CIFS] Can't rely on iov length and base when kernel_recvmsg returns error
Fixes a data corruption under heavy stress in which pages could be left
dirty after all open instances of a inode have been closed.
In order to write contiguous pages whenever possible, cifs_writepages()
asks pagevec_lookup_tag() for more pages than it may write at one time.
Normally, it then resets index just past the last page written before calling
pagevec_lookup_tag() again.
If cifs_writepages() can't write the first page returned, it wasn't resetting
index, and the next call to pagevec_lookup_tag() resulted in skipping all of
the pages it previously returned, even though cifs_writepages() did nothing
with them. This can result in data loss when the file descriptor is about
to be closed.
This patch ensures that index gets set back to the next returned page so
that none get skipped.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Shirish S Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Since these hit the same routines, and are relatively small, it is easier to review
them as one patch.
Fixed incorrect handling of the last option in some cases
Fixed prefixpath handling convert path_consumed into host depended string length (in bytes)
Use non default separator if it is provided in the original mount options
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Igor Mammedov <niallain@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
For a directory tree:
/mnt/subvolA/subvolB
btrfsctl -s /mnt/subvolA/subvolB /mnt
Will create a directory loop with subvolA under subvolB. This
commit uses the forward refs for each subvol and snapshot to error out
before creating the loop.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Subvols and snapshots can now be referenced from any point in the directory
tree. We need to maintain back refs for them so we can find lost
subvols.
Forward refs are added so that we know all of the subvols and
snapshots referenced anywhere in the directory tree of a single subvol. This
can be used to do recursive snapshotting (but they aren't yet) and it is
also used to detect and prevent directory loops when creating new snapshots.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Each subvolume has its own private inode number space, and so we need
to fill in different device numbers for each subvolume to avoid confusing
applications.
This commit puts a struct super_block into struct btrfs_root so it can
call set_anon_super() and get a different device number generated for
each root.
btrfs_rename is changed to prevent renames across subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before, all snapshots and subvolumes lived in a single flat directory. This
was awkward and confusing because the single flat directory was only writable
with the ioctls.
This commit changes the ioctls to create subvols and snapshots at any
point in the directory tree. This requires making separate ioctls for
snapshot and subvol creation instead of a combining them into one.
The subvol ioctl does:
btrfsctl -S subvol_name parent_dir
After the ioctl is done subvol_name lives inside parent_dir.
The snapshot ioctl does:
btrfsctl -s path_for_snapshot root_to_snapshot
path_for_snapshot can be an absolute or relative path. btrfsctl breaks it up
into directory and basename components.
root_to_snapshot can be any file or directory in the FS. The snapshot
is taken of the entire root where that file lives.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In my batch delete/update/insert patch I introduced a free space leak. The
extent that we do the original search on in free_extents is never pinned, so we
always update the block saying that it has free space, but the free space never
actually gets added to the free space tree, since op->del will always be 0 and
it's never actually added to the pinned extents tree.
This patch fixes this problem by making sure we call pin_down_bytes on the
pending extent op and set op->del to the return value of pin_down_bytes so
update_block_group is called with the right value. This seems to fix the case
where we were getting ENOSPC when there was plenty of space available.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
set tcon->ses earlier
If the inital tree connect fails, we'll end up calling cifs_put_smb_ses
with a NULL pointer. Fix it by setting the tcon->ses earlier.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When an I/O error occurs during an intermediate commit on a rolling
transaction, xfs_trans_commit() will free the transaction structure
and the related ticket. However, the duplicate transaction that
gets used as the transaction continues still contains a pointer
to the ticket. Hence when the duplicate transaction is cancelled
and freed, we free the ticket a second time.
Add reference counting to the ticket so that we hold an extra
reference to the ticket over the transaction commit. We drop the
extra reference once we have checked that the transaction commit
did not return an error, thus avoiding a double free on commit
error.
Credit to Nick Piggin for tripping over the problem.
SGI-PV: 989741
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>