Commit Graph

34202 Commits

Author SHA1 Message Date
Al Viro
22a8cb8248 new helper: dump_align()
dump_skip to given alignment...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:27 -05:00
Al Viro
9b56d54380 dump_skip(): dump_seek() replacement taking coredump_params
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:26 -05:00
Al Viro
2507a4fbd4 make dump_emit() use vfs_write() instead of banging at ->f_op->write directly
... and deal with short writes properly - the output might be to pipe, after
all; as it is, e.g. no-MMU case of elf_fdpic coredump can write a whole lot
more than a page worth of data at one call.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:26 -05:00
Al Viro
1ad67015e6 binfmt_elf: count notes towards coredump limit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:25 -05:00
Al Viro
43a5d548eb aout: switch to dump_emit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:25 -05:00
Al Viro
cdc3d5627d switch elf_coredump_extra_notes_write() to dump_emit()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:24 -05:00
Al Viro
e6c1baa9b5 convert the rest of binfmt_elf_fdpic to dump_emit()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:24 -05:00
Al Viro
13046ece96 binfmt_elf: convert writing actual dump pages to dump_emit()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:24 -05:00
Al Viro
aa3e7eaf0a switch elf_core_write_extra_data() to dump_emit()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:23 -05:00
Al Viro
506f21c556 switch elf_core_write_extra_phdrs() to dump_emit()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:23 -05:00
Al Viro
ecc8c7725e new helper: dump_emit()
dump_write() analog, takes core_dump_params instead of file,
keeps track of the amount written in cprm->written and checks for
cprm->limit.  Start using it in binfmt_elf.c...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:22 -05:00
Al Viro
11d100d9a2 coda_revalidate_inode(): switch to passing inode...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:21 -05:00
Al Viro
b61625d245 fold __d_shrink() into its only remaining caller
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:21 -05:00
Al Viro
eee5cc2702 get rid of s_files and files_lock
The only thing we need it for is alt-sysrq-r (emergency remount r/o)
and these days we can do just as well without going through the
list of files.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:20 -05:00
Al Viro
8b61e74ffc get rid of {lock,unlock}_rcu_walk()
those have become aliases for rcu_read_{lock,unlock}()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:20 -05:00
Al Viro
48a066e72d RCU'd vfsmounts
* RCU-delayed freeing of vfsmounts
* vfsmount_lock replaced with a seqlock (mount_lock)
* sequence number from mount_lock is stored in nameidata->m_seq and
used when we exit RCU mode
* new vfsmount flag - MNT_SYNC_UMOUNT.  Set by umount_tree() when its
caller knows that vfsmount will have no surviving references.
* synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
and doing pending mntput().
* new helper: legitimize_mnt(mnt, seq).  Checks the mount_lock sequence
number against seq, then grabs reference to mnt.  Then it rechecks mount_lock
again to close the race and either returns success or drops the reference it
has acquired.  The subtle point is that in case of MNT_SYNC_UMOUNT we can
simply decrement the refcount and sod off - aforementioned synchronize_rcu()
makes sure that final mntput() won't come until we leave RCU mode.  We need
that, since we don't want to end up with some lazy pathwalk racing with
umount() and stealing the final mntput() from it - caller of umount() may
expect it to return only once the fs is shut down and we don't want to break
that.  In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
full-blown mntput() in case of mount_lock sequence number mismatch happening
just as we'd grabbed the reference, but in those cases we won't be stealing
the final mntput() from anything that would care.
* mntput_no_expire() doesn't lock anything on the fast path now.  Incidentally,
SMP and UP cases are handled the same way - no ifdefs there.
* normal pathname resolution does *not* do any writes to mount_lock.  It does,
of course, bump the refcounts of vfsmount and dentry in the very end, but that's
it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:19 -05:00
Al Viro
42c326082d switch shrink_dcache_for_umount() to use of d_walk()
we have too many iterators in fs/dcache.c...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:06 -05:00
Kent Overstreet
6678d83f18 block: Consolidate duplicated bio_trim() implementations
Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on,
we should know better than this.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Neil Brown <neilb@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 09:02:31 -07:00
Christoph Lameter
170d800af8 block: Replace __get_cpu_var uses
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x).  This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area.  __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))

__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset.  Thereby address calculations are avoided and less registers
are used when code is generated.

At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.

The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e.  using a global
register that may be set to the per cpu base.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

	DEFINE_PER_CPU(int, y);
	int *x = &__get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

	DEFINE_PER_CPU(int, y[20]);
	int *x = __get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
variable.

	DEFINE_PER_CPU(int, y);
	int x = __get_cpu_var(y)

   Converts to

	int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

	DEFINE_PER_CPU(struct mystruct, y);
	struct mystruct x = __get_cpu_var(y);

   Converts to

	memcpy(&x, this_cpu_ptr(&y), sizeof(x));

5. Assignment to a per cpu variable

	DEFINE_PER_CPU(int, y)
	__get_cpu_var(y) = x;

   Converts to

	this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

	DEFINE_PER_CPU(int, y);
	__get_cpu_var(y)++

   Converts to

	this_cpu_inc(y)

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 08:59:58 -07:00
Mikulas Patocka
8077c0d983 bdi: test bdi_init failure
There were two places where return value from bdi_init was not tested.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-08 08:59:44 -07:00
Theodore Ts'o
dd1f723bf5 ext4: use prandom_u32() instead of get_random_bytes()
Many of the uses of get_random_bytes() do not actually need
cryptographically secure random numbers.  Replace those uses with a
call to prandom_u32(), which is faster and which doesn't consume
entropy from the /dev/random driver.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-11-08 00:14:53 -05:00
Chao Yu
1d15bd2034 f2fs: fix memory leak after kobject init failed in fill_super
If we failed to init&add kobject when fill_super, stats info and proc object of
f2fs will not be released.
We should free them before we finish fill_super.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-11-08 14:10:29 +09:00
Changman Lee
fb51b5ef9c f2fs: cleanup waiting routine for writeback pages in cp
use genernal method supported by kernel

 o changes from v1
   If any waiter exists at end io, wake up it.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-11-08 14:10:29 +09:00
Eric Sandeen
f275411440 ext4: remove unreachable code after ext4_can_extents_be_merged()
Commit ec22ba8e ("ext4: disable merging of uninitialized extents")
ensured that if either extent under consideration is uninit, we
decline to merge, and ext4_can_extents_be_merged() returns false.

So there is no need for the caller to then test whether the
extent under consideration is unitialized; if it were, we
wouldn't have gotten that far.

The comments were also inaccurate; ext4_can_extents_be_merged()
no longer XORs the states, it fails if *either* is uninit.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-11-07 22:22:08 -05:00
Linus Torvalds
8efdf2b759 Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS updates from Steve French:
 "Includes a couple of fixes, plus changes to make multiplex identifiers
  easier to read and correlate with network traces, and a set of
  enhancements for SMB3 dialect.  Also adds support for per-file
  compression for both cifs and smb2/smb3 ("chattr +c filename).

  Should have at least one other merge request ready by next week with
  some new SMB3 security features and copy offload support"

* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
  Query network adapter info at mount time for debugging
  Fix unused variable warning when CIFS POSIX disabled
  Allow setting per-file compression via CIFS protocol
  Query File System Alignment
  Query device characteristics at mount time from server on SMB2/3 not just on cifs mounts
  cifs: Send a logoff request before removing a smb session
  cifs: Make big endian multiplex ID sequences monotonic on the wire
  cifs: Remove redundant multiplex identifier check from check_smb_hdr()
  Query file system attributes from server on SMB2, not just cifs, mounts
  Allow setting per-file compression via SMB2/3
  Fix corrupt SMB2 ioctl requests
2013-11-08 06:01:47 +09:00
Linus Torvalds
c224b76b56 Merge tag 'nfs-for-3.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
 "Highlights include:

   - Changes to the RPC socket code to allow NFSv4 to turn off
     timeout+retry:
      * Detect TCP connection breakage through the "keepalive" mechanism
   - Add client side support for NFSv4.x migration (Chuck Lever)
   - Add support for multiple security flavour arguments to the "sec="
     mount option (Dros Adamson)
   - fs-cache bugfixes from David Howells:
     * Fix an issue whereby caching can be enabled on a file that is
       open for writing
   - More NFSv4 open code stable bugfixes
   - Various Labeled NFS (selinux) bugfixes, including one stable fix
   - Fix buffer overflow checking in the RPCSEC_GSS upcall encoding"

* tag 'nfs-for-3.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits)
  NFSv4.2: Remove redundant checks in nfs_setsecurity+nfs4_label_init_security
  NFSv4: Sanity check the server reply in _nfs4_server_capabilities
  NFSv4.2: encode_readdir - only ask for labels when doing readdirplus
  nfs: set security label when revalidating inode
  NFSv4.2: Fix a mismatch between Linux labeled NFS and the NFSv4.2 spec
  NFS: Fix a missing initialisation when reading the SELinux label
  nfs: fix oops when trying to set SELinux label
  nfs: fix inverted test for delegation in nfs4_reclaim_open_state
  SUNRPC: Cleanup xs_destroy()
  SUNRPC: close a rare race in xs_tcp_setup_socket.
  SUNRPC: remove duplicated include from clnt.c
  nfs: use IS_ROOT not DCACHE_DISCONNECTED
  SUNRPC: Fix buffer overflow checking in gss_encode_v0_msg/gss_encode_v1_msg
  SUNRPC: gss_alloc_msg - choose _either_ a v0 message or a v1 message
  SUNRPC: remove an unnecessary if statement
  nfs: Use PTR_ERR_OR_ZERO in 'nfs/nfs4super.c'
  nfs: Use PTR_ERR_OR_ZERO in 'nfs41_callback_up' function
  nfs: Remove useless 'error' assignment
  sunrpc: comment typo fix
  SUNRPC: Add correct rcu_dereference annotation in rpc_clnt_set_transport
  ...
2013-11-08 05:57:46 +09:00
Rob Herring
b5480950c6 Merge remote-tracking branch 'grant/devicetree/next' into for-next 2013-11-07 10:34:46 -06:00
Linus Torvalds
a1212d278c Revert "sysfs: drop kobj_ns_type handling"
This reverts commit cb26a31157.

It mysteriously causes NetworkManager to not find the wireless device
for me.  As far as I can tell, Tejun *meant* for this commit to not make
any semantic changes, but there clearly are some.  So revert it, taking
into account some of the calling convention changes that happened in
this area in subsequent commits.

Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-07 20:47:28 +09:00
Linus Torvalds
0324e74534 Merge tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core / sysfs patches from Greg KH:
 "Here's the big driver core / sysfs update for 3.13-rc1.

  There's lots of dev_groups updates for different subsystems, as they
  all get slowly migrated over to the safe versions of the attribute
  groups (removing userspace races with the creation of the sysfs
  files.) Also in here are some kobject updates, devres expansions, and
  the first round of Tejun's sysfs reworking to enable it to be used by
  other subsystems as a backend for an in-kernel filesystem.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (83 commits)
  sysfs: rename sysfs_assoc_lock and explain what it's about
  sysfs: use generic_file_llseek() for sysfs_file_operations
  sysfs: return correct error code on unimplemented mmap()
  mdio_bus: convert bus code to use dev_groups
  device: Make dev_WARN/dev_WARN_ONCE print device as well as driver name
  sysfs: separate out dup filename warning into a separate function
  sysfs: move sysfs_hash_and_remove() to fs/sysfs/dir.c
  sysfs: remove unused sysfs_get_dentry() prototype
  sysfs: honor bin_attr.attr.ignore_lockdep
  sysfs: merge sysfs_elem_bin_attr into sysfs_elem_attr
  devres: restore zeroing behavior of devres_alloc()
  sysfs: fix sysfs_write_file for bin file
  input: gameport: convert bus code to use dev_groups
  input: serio: remove bus usage of dev_attrs
  input: serio: use DEVICE_ATTR_RO()
  i2o: convert bus code to use dev_groups
  memstick: convert bus code to use dev_groups
  tifm: convert bus code to use dev_groups
  virtio: convert bus code to use dev_groups
  ipack: convert bus code to use dev_groups
  ...
2013-11-07 11:42:15 +09:00
Gu Zheng
359d992bcd xfs: simplify kmem_{zone_}zalloc
Introduce flag KM_ZERO which is used to alloc zeroed entry, and convert
kmem_{zone_}zalloc to call kmem_{zone_}alloc() with KM_ZERO directly,
in order to avoid the setting to zero step. 
And following Dave's suggestion, make kmem_{zone_}zalloc static inline
into kmem.h as they're now just a simple wrapper.

V2:
  Make kmem_{zone_}zalloc static inline into kmem.h as Dave suggested.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-11-06 16:31:27 -06:00
Dave Chinner
d123031a56 xfs: add tracepoints to AGF/AGI read operations
To help track down AGI/AGF lock ordering issues, I added these
tracepoints to tell us when an AGI or AGF is read and locked.  With
these we can now determine if the lock ordering goes wrong from
tracing captures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-11-06 12:42:52 -06:00
Dave Chinner
750b9c9066 xfs: trace AIL manipulations
I debugging a log tail issue on a RHEL6 kernel, I added these trace
points to trace log items being added, moved and removed in the AIL
and how that affected the log tail LSN that was written to the log.
They were very helpful in that they immediately identified the cause
of the problem being seen. Hence I'd like to always have them
available for use.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-11-06 12:41:51 -06:00
John Stultz
1ca7d67cf5 seqcount: Add lockdep functionality to seqcount/seqlock structures
Currently seqlocks and seqcounts don't support lockdep.

After running across a seqcount related deadlock in the timekeeping
code, I used a less-refined and more focused variant of this patch
to narrow down the cause of the issue.

This is a first-pass attempt to properly enable lockdep functionality
on seqlocks and seqcounts.

Since seqcounts are used in the vdso gettimeofday code, I've provided
non-lockdep accessors for those needs.

I've also handled one case where there were nested seqlock writers
and there may be more edge cases.

Comments and feedback would be appreciated!

Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 12:40:26 +01:00
Chao Yu
3b03f72445 f2fs: avoid to use a NULL point in destroy_segment_manager
A NULL point should avoid to be used in destroy_segment_manager after allocating
memory fail for f2fs_sm_info.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-11-06 16:37:44 +09:00
Ingo Molnar
c90423d1de Merge branch 'sched/core' into core/locking, to prepare the kernel/locking/ file move
Conflicts:
	kernel/Makefile

There are conflicts in kernel/Makefile due to file moving in the
scheduler tree - resolve them.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 07:50:37 +01:00
Richard Guy Briggs
9410d228a4 audit: call audit_bprm() only once to add AUDIT_EXECVE information
Move the audit_bprm() call from search_binary_handler() to exec_binprm().  This
allows us to get rid of the mm member of struct audit_aux_data_execve since
bprm->mm will equal current->mm.

This also mitigates the issue that ->argc could be modified by the
load_binary() call in search_binary_handler().

audit_bprm() was being called to add an AUDIT_EXECVE record to the audit
context every time search_binary_handler() was recursively called.  Only one
reference is necessary.

Reported-by: Oleg Nesterov <onestero@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
---
This patch is against 3.11, but was developed on Oleg's post-3.11 patches that
introduce exec_binprm().
2013-11-05 11:15:03 -05:00
Jeff Layton
14e972b451 audit: add child record before the create to handle case where create fails
Historically, when a syscall that creates a dentry fails, you get an audit
record that looks something like this (when trying to create a file named
"new" in "/tmp/tmp.SxiLnCcv63"):

    type=PATH msg=audit(1366128956.279:965): item=0 name="/tmp/tmp.SxiLnCcv63/new" inode=2138308 dev=fd:02 mode=040700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmp_t:s15:c0.c1023

This record makes no sense since it's associating the inode information for
"/tmp/tmp.SxiLnCcv63" with the path "/tmp/tmp.SxiLnCcv63/new". The recent
patch I posted to fix the audit_inode call in do_last fixes this, by making it
look more like this:

    type=PATH msg=audit(1366128765.989:13875): item=0 name="/tmp/tmp.DJ1O8V3e4f/" inode=141 dev=fd:02 mode=040700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmp_t:s15:c0.c1023

While this is more correct, if the creation of the file fails, then we
have no record of the filename that the user tried to create.

This patch adds a call to audit_inode_child to may_create. This creates
an AUDIT_TYPE_CHILD_CREATE record that will sit in place until the
create succeeds. When and if the create does succeed, then this record
will be updated with the correct inode info from the create.

This fixes what was broken in commit bfcec708.
Commit 79f6530c should also be backported to stable v3.7+.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-11-05 11:08:44 -05:00
Eric Paris
81407c84ac audit: allow unsetting the loginuid (with priv)
If a task has CAP_AUDIT_CONTROL allow that task to unset their loginuid.
This would allow a child of that task to set their loginuid without
CAP_AUDIT_CONTROL.  Thus when launching a new login daemon, a
priviledged helper would be able to unset the loginuid and then the
daemon, which may be malicious user facing, do not need priv to function
correctly.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-11-05 11:08:09 -05:00
Jan Kara
7ba3ec5749 ext2: Fix fs corruption in ext2_get_xip_mem()
Commit 8e3dffc651 "Ext2: mark inode dirty after the function
dquot_free_block_nodirty is called" unveiled a bug in __ext2_get_block()
called from ext2_get_xip_mem(). That function called ext2_get_block()
mistakenly asking it to map 0 blocks while 1 was intended. Before the
above mentioned commit things worked out fine by luck but after that commit
we started returning that we allocated 0 blocks while we in fact
allocated 1 block and thus allocation was looping until all blocks in
the filesystem were exhausted.

Fix the problem by properly asking for one block and also add assertion
in ext2_get_blocks() to catch similar problems.

Reported-and-tested-by: Andiry Xu <andiry.xu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-11-05 11:26:47 +01:00
Maxim Patlasov
ce128de626 fuse: writepages: protect secondary requests from fuse file release
All async fuse requests must be supplied with extra reference to a fuse
file.  This is necessary to ensure that the fuse file is not released until
all in-flight requests are completed.  Fuse secondary writeback requests
must obey this rule as well.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-11-05 10:11:29 +01:00
Maxim Patlasov
41b6e41fc6 fuse: writepages: update bdi writeout when deleting secondary request
BDI_WRITTEN counter is used to estimate bdi bandwidth.  It must be
incremented every time as bdi ends page writeback.  No matter whether it
was fulfilled by actual write or by discarding the request (e.g. due to
shrunk i_size).

Note that even before writepages patches, the case "Got truncated off
completely" was handled in fuse_send_writepage() by calling
fuse_writepage_finish() which updated BDI_WRITTEN unconditionally.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-11-05 10:11:28 +01:00
Maxim Patlasov
6eaf4782eb fuse: writepages: crop secondary requests
If writeback happens while fuse is in FUSE_NOWRITE condition, the request
will be queued but not processed immediately (see fuse_flush_writepages()).
Until FUSE_NOWRITE becomes relaxed, more writebacks can happen.  They will
be queued as "secondary" requests to that first ("primary") request.

Existing implementation crops only primary request.  This is not correct
because a subsequent extending write(2) may increase i_size and then
secondary requests won't be cropped properly.  The result would be stale
data written to the server to a file offset where zeros must be.

Similar problem may happen if secondary requests are attached to an
in-flight request that was already cropped.

The patch solves the issue by cropping all secondary requests in
fuse_writepage_end().  Thanks to Miklos for idea.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-11-05 10:11:27 +01:00
Maxim Patlasov
f6011081f5 fuse: writepages: roll back changes if request not found
fuse_writepage_in_flight() returns false if it fails to find request with
given index in fi->writepages.  Then the caller proceeds with populating
data->orig_pages[] and incrementing req->num_pages.  Hence,
fuse_writepage_in_flight() must revert changes it made in request before
returning false.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-11-05 10:11:26 +01:00
J. Bruce Fields
b78800baee Revert "nfsd: remove_stid can be incorporated into nfs4_put_delegation"
This reverts commit 7ebe40f203.  We forgot
the nfs4_put_delegation call in fs/nfsd/nfs4callback.c which should not
be unhashing the stateid.  This lead to warnings from the idr code when
we tried to removed id's twice.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-11-04 17:46:50 -05:00
Trond Myklebust
fab99ebe39 NFSv4.2: Remove redundant checks in nfs_setsecurity+nfs4_label_init_security
We already check for nfs_server_capable(inode, NFS_CAP_SECURITY_LABEL)
in nfs4_label_alloc()
We check the minor version in _nfs4_server_capabilities before setting
NFS_CAP_SECURITY_LABEL.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-04 16:42:52 -05:00
Trond Myklebust
b944dba31d NFSv4: Sanity check the server reply in _nfs4_server_capabilities
We don't want to be setting capabilities and/or requesting attributes
that are not appropriate for the NFSv4 minor version.

- Ensure that we clear the NFS_CAP_SECURITY_LABEL capability when appropriate
- Ensure that we limit the attribute bitmasks to the mounted_on_fileid
  attribute and less for NFSv4.0
- Ensure that we limit the attribute bitmasks to suppattr_exclcreat and
  less for NFSv4.1
- Ensure that we limit it to change_sec_label or less for NFSv4.2

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-04 16:42:52 -05:00
Trond Myklebust
d204c5d2b8 NFSv4.2: encode_readdir - only ask for labels when doing readdirplus
Currently, if the server is doing NFSv4.2 and supports labeled NFS, then
our on-the-wire READDIR request ends up asking for the label information,
which is then ignored unless we're doing readdirplus.
This patch ensures that READDIR doesn't ask the server for label information
at all unless the readdir->bitmask contains the FATTR4_WORD2_SECURITY_LABEL
attribute, and the readdir->plus flag is set.

While we're at it, optimise away the 3rd bitmap field if it is zero.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-04 16:42:51 -05:00
Jeff Layton
3da580aab9 nfs: set security label when revalidating inode
Currently, we fetch the security label when revalidating an inode's
attributes, but don't apply it. This is in contrast to the readdir()
codepath where we do apply label changes.

Cc: Dave Quigley <dpquigl@davequigley.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-04 16:42:38 -05:00
Dave Chinner
273203699f xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering
Removing an inode from the namespace involves removing the directory
entry and dropping the link count on the inode. Removing the
directory entry can result in locking an AGF (directory blocks were
freed) and removing a link count can result in placing the inode on
an unlinked list which results in locking an AGI.

The big problem here is that we have an ordering constraint on AGF
and AGI locking - inode allocation locks the AGI, then can allocate
a new extent for new inodes, locking the AGF after the AGI.
Similarly, freeing the inode removes the inode from the unlinked
list, requiring that we lock the AGI first, and then freeing the
inode can result in an inode chunk being freed and hence freeing
disk space requiring that we lock an AGF.

Hence the ordering that is imposed by other parts of the code is AGI
before AGF. This means we cannot remove the directory entry before
we drop the inode reference count and put it on the unlinked list as
this results in a lock order of AGF then AGI, and this can deadlock
against inode allocation and freeing. Therefore we must drop the
link counts before we remove the directory entry.

This is still safe from a transactional point of view - it is not
until we get to xfs_bmap_finish() that we have the possibility of
multiple transactions in this operation. Hence as long as we remove
the directory entry and drop the link count in the first transaction
of the remove operation, there are no transactional constraints on
the ordering here.

Change the ordering of the operations in the xfs_remove() function
to align the ordering of AGI and AGF locking to match that of the
rest of the code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-11-04 13:18:48 -06:00
Benjamin LaHaise
13fd8a5dc3 Merge branch 'aio-fix' of http://evilpiepirate.org/git/linux-bcache 2013-11-04 13:45:25 -05:00