Commit Graph

30831 Commits

Author SHA1 Message Date
Dan Carpenter
da0503aae0 epoll: remove unneeded variable in reverse_path_check()
We never use the length variable.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23 16:58:38 -07:00
Steven Rostedt
02edc6fc4d epoll: comment the funky #ifdef
Looking for a bug in -rt, I stumbled across this code here from: commit
2dfa4eeab0 ("epoll keyed wakeups: teach epoll about hints coming with
the wakeup key"), specifically:

  #ifdef CONFIG_DEBUG_LOCK_ALLOC
  static inline void ep_wake_up_nested(wait_queue_head_t *wqueue,
                                      unsigned long events, int subclass)
  {
         unsigned long flags;

         spin_lock_irqsave_nested(&wqueue->lock, flags, subclass);
         wake_up_locked_poll(wqueue, events);
         spin_unlock_irqrestore(&wqueue->lock, flags);
  }
  #else
  static inline void ep_wake_up_nested(wait_queue_head_t *wqueue,
                                      unsigned long events, int subclass)
  {
         wake_up_poll(wqueue, events);
  }
  #endif

You change the function of ep_wake_up_nested() depending on whether
CONFIG_DEBUG_LOCK_ALLOC is set or not.  This looks awfully suspicious,
and there's no comment to explain why.  I initially thought that this
was trying to fool lockdep, and hiding a real bug.

Investigating it, I found the creation of wake_up_nested() (which no
longer exists) but was created for the sole purpose of epoll and its
strange wake ups, as explained in commit 0ccf831cbe ("lockdep:
annotate epoll")

Although the commit message says "annotate epoll" the change log is much
better at explaining what is happening than what is in the actual code.
Thus a comment is really necessary here.  And to save the time of other
developers from having to go trudging through the git logs trying to
figure out why this code exists.

I took parts of the change log and placed it into a comment above the
affected code.  This will make the description of what is happening more
visible to new developers that have to look at this code for the first
time.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23 16:58:38 -07:00
Hans Verkuil
626cf23660 poll: add poll_requested_events() and poll_does_not_wait() functions
In some cases the poll() implementation in a driver has to do different
things depending on the events the caller wants to poll for.  An example
is when a driver needs to start a DMA engine if the caller polls for
POLLIN, but doesn't want to do that if POLLIN is not requested but instead
only POLLOUT or POLLPRI is requested.  This is something that can happen
in the video4linux subsystem among others.

Unfortunately, the current epoll/poll/select implementation doesn't
provide that information reliably.  The poll_table_struct does have it: it
has a key field with the event mask.  But once a poll() call matches one
or more bits of that mask any following poll() calls are passed a NULL
poll_table pointer.

Also, the eventpoll implementation always left the key field at ~0 instead
of using the requested events mask.

This was changed in eventpoll.c so the key field now contains the actual
events that should be polled for as set by the caller.

The solution to the NULL poll_table pointer is to set the qproc field to
NULL in poll_table once poll() matches the events, not the poll_table
pointer itself.  That way drivers can obtain the mask through a new
poll_requested_events inline.

The poll_table_struct can still be NULL since some kernel code calls it
internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h).  In
that case poll_requested_events() returns ~0 (i.e.  all events).

Very rarely drivers might want to know whether poll_wait will actually
wait.  If another earlier file descriptor in the set already matched the
events the caller wanted to wait for, then the kernel will return from the
select() call without waiting.  This might be useful information in order
to avoid doing expensive work.

A new helper function poll_does_not_wait() is added that drivers can use
to detect this situation.  This is now used in sock_poll_wait() in
include/net/sock.h.  This was the only place in the kernel that needed
this information.

Drivers should no longer access any of the poll_table internals, but use
the poll_requested_events() and poll_does_not_wait() access functions
instead.  In order to enforce that the poll_table fields are now prepended
with an underscore and a comment was added warning against using them
directly.

This required a change in unix_dgram_poll() in unix/af_unix.c which used
the key field to get the requested events.  It's been replaced by a call
to poll_requested_events().

For qproc it was especially important to change its name since the
behavior of that field changes with this patch since this function pointer
can now be NULL when that wasn't possible in the past.

Any driver accessing the qproc or key fields directly will now fail to compile.

Some notes regarding the correctness of this patch: the driver's poll()
function is called with a 'struct poll_table_struct *wait' argument.  This
pointer may or may not be NULL, drivers can never rely on it being one or
the other as that depends on whether or not an earlier file descriptor in
the select()'s fdset matched the requested events.

There are only three things a driver can do with the wait argument:

1) obtain the key field:

	events = wait ? wait->key : ~0;

   This will still work although it should be replaced with the new
   poll_requested_events() function (which does exactly the same).
   This will now even work better, since wait is no longer set to NULL
   unnecessarily.

2) use the qproc callback. This could be deadly since qproc can now be
   NULL. Renaming qproc should prevent this from happening. There are no
   kernel drivers that actually access this callback directly, BTW.

3) test whether wait == NULL to determine whether poll would return without
   waiting. This is no longer sufficient as the correct test is now
   wait == NULL || wait->_qproc == NULL.

   However, the worst that can happen here is a slight performance hit in
   the case where wait != NULL and wait->_qproc == NULL. In that case the
   driver will assume that poll_wait() will actually add the fd to the set
   of waiting file descriptors. Of course, poll_wait() will not do that
   since it tests for wait->_qproc. This will not break anything, though.

   There is only one place in the whole kernel where this happens
   (sock_poll_wait() in include/net/sock.h) and that code will be replaced
   by a call to poll_does_not_wait() in the next patch.

   Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
   actually waiting. The next file descriptor from the set might match the
   event mask and thus any possible waits will never happen.

Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Reviewed-by: Jonathan Corbet <corbet@lwn.net>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23 16:58:38 -07:00
H Hartley Sweeten
9710a78e55 fs/notify/notification.c: make subsys_initcall function static
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23 16:58:31 -07:00
Muthu Kumar
b502bd1152 magic.h: move some FS magic numbers into magic.h
- Move open-coded filesystem magic numbers into magic.h

- Rearrange magic.h so that the filesystem-related constants are grouped
  together.

Signed-off-by: Muthukumar R <muthur@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23 16:58:31 -07:00
Steve French
c7ad42b52d [CIFS] Fix trivial sparse warning with asyn i/o patch
Signed-off-by: Steve French <sfrench@us.ibm.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2012-03-23 16:30:56 -05:00
Jeff Layton
d81625587f cifs: handle "sloppy" option appropriately
cifs.ko has historically been tolerant of options that it does not
recognize. This is not normal behavior for a filesystem however.
Usually, it should only do this if you mount with '-s', and autofs
generally passes -s to the mount command to allow this behavior.

This patch makes cifs handle the option "sloppy" appropriately. If it's
present in the options string, then the client will tolerate options
that it doesn't recognize. If it's not present then the client will
error out in the presence of options that it does not recognize and
throw an error message explaining why.

There is also a companion patch being proposed for mount.cifs to make it
append "sloppy" to the mount options when passed the '-s' flag. This also
should (obviously) be applied on top of Sachin's conversion to the
standard option parser.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-By: Sachin Prabhu <sprabhu@redhat.com>
2012-03-23 14:40:56 -04:00
Sachin Prabhu
8830d7e07a cifs: use standard token parser for mount options
Use the standard token parser instead of the long if condition to parse
cifs mount options.

This was first proposed by Scott Lovenberg
http://lists.samba.org/archive/linux-cifs-client/2010-May/006079.html

Mount options have been grouped together in terms of their input types.
Aliases for username, password, domain and credentials have been added.
The password parser has been modified to make it easier to read.

Since the patch was first proposed, the following bugs have been fixed
1) Allow blank 'pass' option to be passed by the cifs mount helper when
using sec=none.
2) Do not explicitly set vol->nullauth to 0. This causes a problem
when using sec=none while also using a username.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2012-03-23 14:40:56 -04:00
Jeff Layton
27ac5755ae cifs: remove /proc/fs/cifs/OplockEnabled
We've had a deprecation warning on this file for 2 releases now. Remove
it as promised for 3.4.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2012-03-23 14:40:56 -04:00
Jeff Layton
da82f7e755 cifs: convert cifs_iovec_write to use async writes
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
2012-03-23 14:40:56 -04:00
Jeff Layton
597b027f69 cifs: call cifs_update_eof with i_lock held
cifs_update_eof has the potential to be racy if multiple threads are
trying to modify it at the same time. Protect modifications of the
server_eof value with the inode->i_lock.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2012-03-23 14:40:56 -04:00
Jeff Layton
e9492871fb cifs: abstract out function to marshal up the iovec array for async writes
We'll need to do something a bit different depending on the caller.
Abstract the code that marshals the page array into an iovec.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
2012-03-23 14:40:56 -04:00
Jeff Layton
a7103b99e4 cifs: fix up get_numpages
Use DIV_ROUND_UP. Also, PAGE_SIZE is more appropriate here since these
aren't pagecache pages.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
2012-03-23 14:40:56 -04:00
Jeff Layton
35ebb4155f cifs: make cifsFileInfo_get return the cifsFileInfo pointer
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
2012-03-23 14:40:56 -04:00
Jeff Layton
e94f7ba124 cifs: fix allocation in cifs_write_allocate_pages
The gfp flags are currently set to __GPF_HIGHMEM, which doesn't allow
for any reclaim. Make this more resilient by or'ing that with
GFP_KERNEL. Also, get rid of the goto and unify the exit codepath.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastry@etersoft.ru>
2012-03-23 14:40:56 -04:00
Jeff Layton
c2e8764009 cifs: allow caller to specify completion op when allocating writedata
We'll need a different set of write completion ops when not writing out
of the pagecache.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
2012-03-23 14:40:55 -04:00
Jeff Layton
fe5f5d2e90 cifs: add pid field to cifs_writedata
We'll need this to handle rwpidforward option correctly when we use
async writes in the aio_write op.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
2012-03-23 14:40:55 -04:00
Jeff Layton
da472fc847 cifs: add new cifsiod_wq workqueue
...and convert existing cifs users of system_nrt_wq to use that instead.

Also, make it freezable, and set WQ_MEM_RECLAIM since we use it to
deal with write reply handling.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
2012-03-23 14:40:53 -04:00
Pavel Shilovsky
7c9421e1a9 CIFS: Change mid_q_entry structure fields
to be protocol-unspecific and big enough to keep both CIFS
and SMB2 values.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
2012-03-23 14:28:03 -04:00
Pavel Shilovsky
243d04b6e6 CIFS: Expand CurrentMid field
While in CIFS/SMB we have 16 bit mid, in SMB2 it is 64 bit.
Convert the existing field to 64 bit and mask off higher bits
for CIFS/SMB.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
2012-03-23 14:28:03 -04:00
Pavel Shilovsky
5ffef7bf1d CIFS: Separate protocol-specific code from cifs_readv_receive code
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
2012-03-23 14:28:03 -04:00
Pavel Shilovsky
d4e4854fd1 CIFS: Separate protocol-specific code from demultiplex code
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
2012-03-23 14:28:02 -04:00
Pavel Shilovsky
792af7b05b CIFS: Separate protocol-specific code from transport routines
that lets us use this functions for SMB2.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
2012-03-23 14:28:02 -04:00
Linus Torvalds
e57f146b28 Merge tag 'upstream-3.4-rc1' of git://git.infradead.org/linux-ubifs
Pull UBIFS changes from Artem Bityutskiy:
 - Improve error messages
 - Clean-up i_nlink management
 - Minor clean-ups

* tag 'upstream-3.4-rc1' of git://git.infradead.org/linux-ubifs:
  UBIFS: improve error messages
  UBIFS: kill CUR_MAX_KEY_LEN macro
  UBIFS: do not use inc_link when i_nlink is zero
  UBIFS: make the dbg_lock spinlock static
  UBIFS: increase dumps loglevel
  UBIFS: amend recovery debugging message
2012-03-23 09:27:40 -07:00
Linus Torvalds
6e55f8ed81 Merge tag 'pstore-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull one pstore patch from Tony Luck

* tag 'pstore-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  pstore: Introduce get_reason_str() to pstore
2012-03-23 09:24:07 -07:00
Linus Torvalds
49d99a2f9c Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
Pull XFS updates from Ben Myers:
 "Scalability improvements for dquots, log grant code cleanups, plus
  bugfixes and cleanups large and small"

Fix up various trivial conflicts that were due to some of the earlier
patches already having been integrated into v3.3 as bugfixes, and then
there were development patches on top of those.  Easily merged by just
taking the newer version from the pulled branch.

* 'for-linus' of git://oss.sgi.com/xfs/xfs: (45 commits)
  xfs: fallback to vmalloc for large buffers in xfs_getbmap
  xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get
  xfs: remove remaining scraps of struct xfs_iomap
  xfs: fix inode lookup race
  xfs: clean up minor sparse warnings
  xfs: remove the global xfs_Gqm structure
  xfs: remove the per-filesystem list of dquots
  xfs: use per-filesystem radix trees for dquot lookup
  xfs: per-filesystem dquot LRU lists
  xfs: use common code for quota statistics
  xfs: reimplement fdatasync support
  xfs: split in-core and on-disk inode log item fields
  xfs: make xfs_inode_item_size idempotent
  xfs: log timestamp updates
  xfs: log file size updates at I/O completion time
  xfs: log file size updates as part of unwritten extent conversion
  xfs: do not require an ioend for new EOF calculation
  xfs: use per-filesystem I/O completion workqueues
  quota: make Q_XQUOTASYNC a noop
  xfs: include reservations in quota reporting
  ...
2012-03-23 09:19:22 -07:00
Linus Torvalds
1c3ddfe5ab Merge git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French

* git://git.samba.org/sfrench/cifs-2.6:
  cifs: clean up ordering in exit_cifs
  cifs: clean up call to cifs_dfs_release_automount_timer()
  CIFS: Delete echo_retries module parm
  CIFS: Prepare credits code for a slot reservation
  CIFS: Make wait_for_free_request killable
  CIFS: Introduce credit-based flow control
  CIFS: Simplify inFlight logic
  cifs: fix issue mounting of DFS ROOT when redirecting from one domain controller to the next
  CIFS: Respect negotiated MaxMpxCount
  CIFS: Fix a spurious error in cifs_push_posix_locks
2012-03-23 09:07:15 -07:00
Linus Torvalds
f63d395d47 Merge tag 'nfs-for-3.4-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates for Linux 3.4 from Trond Myklebust:
 "New features include:
   - Add NFS client support for containers.

     This should enable most of the necessary functionality, including
     lockd support, and support for rpc.statd, NFSv4 idmapper and
     RPCSEC_GSS upcalls into the correct network namespace from which
     the mount system call was issued.

   - NFSv4 idmapper scalability improvements

     Base the idmapper cache on the keyring interface to allow
     concurrent access to idmapper entries.  Start the process of
     migrating users from the single-threaded daemon-based approach to
     the multi-threaded request-key based approach.

   - NFSv4.1 implementation id.

     Allows the NFSv4.1 client and server to mutually identify each
     other for logging and debugging purposes.

   - Support the 'vers=4.1' mount option for mounting NFSv4.1 instead of
     having to use the more counterintuitive 'vers=4,minorversion=1'.

   - SUNRPC tracepoints.

     Start the process of adding tracepoints in order to improve
     debugging of the RPC layer.

   - pNFS object layout support for autologin.

  Important bugfixes include:

   - Fix a bug in rpc_wake_up/rpc_wake_up_status that caused them to
     fail to wake up all tasks when applied to priority waitqueues.

   - Ensure that we handle read delegations correctly, when we try to
     truncate a file.

   - A number of fixes for NFSv4 state manager loops (mostly to do with
     delegation recovery)."

* tag 'nfs-for-3.4-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (224 commits)
  NFS: fix sb->s_id in nfs debug prints
  xprtrdma: Remove assumption that each segment is <= PAGE_SIZE
  xprtrdma: The transport should not bug-check when a dup reply is received
  pnfs-obj: autologin: Add support for protocol autologin
  NFS: Remove nfs4_setup_sequence from generic rename code
  NFS: Remove nfs4_setup_sequence from generic unlink code
  NFS: Remove nfs4_setup_sequence from generic read code
  NFS: Remove nfs4_setup_sequence from generic write code
  NFS: Fix more NFS debug related build warnings
  SUNRPC/LOCKD: Fix build warnings when CONFIG_SUNRPC_DEBUG is undefined
  nfs: non void functions must return a value
  SUNRPC: Kill compiler warning when RPC_DEBUG is unset
  SUNRPC/NFS: Add Kbuild dependencies for NFS_DEBUG/RPC_DEBUG
  NFS: Use cond_resched_lock() to reduce latencies in the commit scans
  NFSv4: It is not safe to dereference lsp->ls_state in release_lockowner
  NFS: ncommit count is being double decremented
  SUNRPC: We must not use list_for_each_entry_safe() in rpc_wake_up()
  Try using machine credentials for RENEW calls
  NFSv4.1: Fix a few issues in filelayout_commit_pagelist
  NFSv4.1: Clean ups and bugfixes for the pNFS read/writeback/commit code
  ...
2012-03-23 08:53:47 -07:00
Linus Torvalds
aab008db80 Merge tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm
Pull cleancache changes from Konrad Rzeszutek Wilk:
 "This has some patches for the cleancache API that should have been
  submitted a _long_ time ago.  They are basically cleanups:

   - rename of flush to invalidate

   - moving reporting of statistics into debugfs

   - use __read_mostly as necessary.

  Oh, and also the MAINTAINERS file change.  The files (except the
  MAINTAINERS file) have been in #linux-next for months now.  The late
  addition of MAINTAINERS file is a brain-fart on my side - didn't
  realize I needed that just until I was typing this up - and I based
  that patch on v3.3 - so the tree is on top of v3.3."

* tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
  MAINTAINERS: Adding cleancache API to the list.
  mm: cleancache: Use __read_mostly as appropiate.
  mm: cleancache: report statistics via debugfs instead of sysfs.
  mm: zcache/tmem/cleancache: s/flush/invalidate/
  mm: cleancache: s/flush/invalidate/
2012-03-22 19:52:47 -07:00
Linus Torvalds
f7493e5d9c vfs: tidy up sparse warnings in fs/namei.c
While doing the fs/namei.c cleanups, I ran sparse on it, and it pointed
out other large integers and a couple of cases of us using '0' instead
of the proper 'NULL'.

Sparse still doesn't understand some of the conditional locking going
on, but that's no excuse for not fixing up the trivial stuff.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 16:10:40 -07:00
Linus Torvalds
989412bbd2 vfs: tidy up fs/namei.c byte-repeat word constants
In commit commit 1de5b41cd3 ("fs/namei.c: fix warnings on 32-bit")
Andrew said that there must be a tidier way of doing this.

This is that tidier way.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 15:58:27 -07:00
Randy Dunlap
1f1e6e523e fs: fix kernel-doc warnings in dcache.c
Fix kernel-doc warnings in fs/dcache.c:

  Warning(fs/dcache.c:1743): No description found for parameter 'seqp'
  Warning(fs/dcache.c:1743): Excess function parameter 'seq' description in '__d_lookup_rcu'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 15:49:18 -07:00
Al Viro
f132c5be05 Fix full_name_hash() behaviour when length is a multiple of 8
We want it to match what hash_name() is doing, which means extra
multiply by 9 in this case...

Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 15:10:43 -07:00
Lucas De Marchi
4e474a00d7 sysctl: protect poll() in entries that may go away
Protect code accessing ctl_table by grabbing the header with grab_header()
and after releasing with sysctl_head_finish().  This is needed if poll()
is called in entries created by modules: currently only hostname and
domainname support poll(), but this bug may be triggered when/if modules
use it and if user called poll() in a file that doesn't support it.

Dave Jones reported the following when using a syscall fuzzer while
hibernating/resuming:

RIP: 0010:[<ffffffff81233e3e>]  [<ffffffff81233e3e>] proc_sys_poll+0x4e/0x90
RAX: 0000000000000145 RBX: ffff88020cab6940 RCX: 0000000000000000
RDX: ffffffff81233df0 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88020cab6940
[ ... ]
Code: 00 48 89 fb 48 89 f1 48 8b 40 30 4c 8b 60 e8 b8 45 01 00 00 49 83
7c 24 28 00 74 2e 49 8b 74 24 30 48 85 f6 74 24 48 85 c9 75 32 <8b> 16
b8 45 01 00 00 48 63 d2 49 39 d5 74 10 8b 06 48 98 48 89

If an entry goes away while we are polling() it, ctl_table may not exist
anymore.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-03-22 14:46:56 -07:00
Dave Chinner
c999a223c2 xfs: introduce an allocation workqueue
We currently have significant issues with the amount of stack that
allocation in XFS uses, especially in the writeback path. We can
easily consume 4k of stack between mapping the page, manipulating
the bmap btree and allocating blocks from the free list. Not to
mention btree block readahead and other functionality that issues IO
in the allocation path.

As a result, we can no longer fit allocation in the writeback path
in the stack space provided on x86_64. To alleviate this problem,
introduce an allocation workqueue and move all allocations to a
seperate context. This can be easily added as an interposing layer
into xfs_alloc_vextent(), which takes a single argument structure
and does not return until the allocation is complete or has failed.

To do this, add a work structure and a completion to the allocation
args structure. This allows xfs_alloc_vextent to queue the args onto
the workqueue and wait for it to be completed by the worker. This
can be done completely transparently to the caller.

The worker function needs to ensure that it sets and clears the
PF_TRANS flag appropriately as it is being run in an active
transaction context. Work can also be queued in a memory reclaim
context, so a rescuer is needed for the workqueue.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-03-22 16:12:24 -05:00
Dave Chinner
1a1d772433 xfs: Fix open flag handling in open_by_handle code
Sparse identified some unsafe handling of open flags in the xfs open
by handle ioctl code. Update the code to use the correct access
macros to ensure that we handle the open flags correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-03-22 15:56:52 -05:00
Kamal Dasu
5575acc780 xfs: fix deadlock in xfs_rtfree_extent
To fix the deadlock caused by repeatedly calling xfs_rtfree_extent

 - removed xfs_ilock() and xfs_trans_ijoin() from xfs_rtfree_extent(),
   instead added asserts that the inode is locked and has an inode_item
   attached to it.
 - in xfs_bunmapi() when dealing with an inode with the rt flag
   call xfs_ilock() and xfs_trans_ijoin() so that the
   reference count is bumped on the inode and attached it to the
   transaction before calling into xfs_bmap_del_extent, similar to
   what we do in xfs_bmap_rtalloc.

Signed-off-by: Kamal Dasu <kdasu.kdev@gmail.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-03-22 15:31:06 -05:00
Gerard Snitselaar
1c2ccc66bc fs: xfs: fix section mismatch in linux-next
xfs_qm_exit() is called in init_xfs_fs().

Signed-off-by: Gerard Snitselaar <dev@snitselaar.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2012-03-22 13:48:55 -05:00
Linus Torvalds
95211279c5 Merge branch 'akpm' (Andrew's patch-bomb)
Merge first batch of patches from Andrew Morton:
 "A few misc things and all the MM queue"

* emailed from Andrew Morton <akpm@linux-foundation.org>: (92 commits)
  memcg: avoid THP split in task migration
  thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
  memcg: clean up existing move charge code
  mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
  mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
  mm/memcontrol.c: s/stealed/stolen/
  memcg: fix performance of mem_cgroup_begin_update_page_stat()
  memcg: remove PCG_FILE_MAPPED
  memcg: use new logic for page stat accounting
  memcg: remove PCG_MOVE_LOCK flag from page_cgroup
  memcg: simplify move_account() check
  memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
  memcg: kill dead prev_priority stubs
  memcg: remove PCG_CACHE page_cgroup flag
  memcg: let css_get_next() rely upon rcu_read_lock()
  cgroup: revert ss_id_lock to spinlock
  idr: make idr_get_next() good for rcu_read_lock()
  memcg: remove unnecessary thp check in page stat accounting
  memcg: remove redundant returns
  memcg: enum lru_list lru
  ...
2012-03-22 09:04:48 -07:00
Alex Elder
3489b42a72 ceph: fix three bugs, two in ceph_vxattrcb_file_layout()
In ceph_vxattrcb_file_layout(), there is a check to determine
whether a preferred PG should be formatted into the output buffer.
That check assumes that a preferred PG number of 0 indicates "no
preference," but that is wrong.  No preference is indicated by a
negative (specifically, -1) PG number.

In addition, if that condition yields true, the preferred value
is formatted into a sized buffer, but the size consumed by the
earlier snprintf() call is not accounted for, opening up the
possibilty of a buffer overrun.

Finally, in ceph_vxattrcb_dir_rctime() where the nanoseconds part of
the time displayed did not include leading 0's, which led to
erroneous (sub-second portion of) time values being shown.

This fixes these three issues:
    http://tracker.newdream.net/issues/2155
    http://tracker.newdream.net/issues/2156
    http://tracker.newdream.net/issues/2157

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:52 -05:00
Alex Elder
cffaba15cd ceph: ensure Boolean options support both senses
Many ceph-related Boolean options offer the ability to both enable
and disable a feature.  For all those that don't offer this, add
a new option so that they do.

Note that ceph_show_options()--which reports mount options currently
in effect--only reports the option if it is different from the
default value.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
ee57741c52 rbd: make ceph_parse_options() return a pointer
ceph_parse_options() takes the address of a pointer as an argument
and uses it to return the address of an allocated structure if
successful.  With this interface is not evident at call sites that
the pointer is always initialized.  Change the interface to return
the address instead (or a pointer-coded error code) to make the
validity of the returned pointer obvious.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder
18fa8b3fea ceph: make ceph_setxattr() and ceph_removexattr() more alike
This patch just rearranges a few bits of code to make more
portions of ceph_setxattr() and ceph_removexattr() identical.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Alex Elder
3ce6cd1233 ceph: avoid repeatedly computing the size of constant vxattr names
All names defined in the directory and file virtual extended
attribute tables are constant, and the size of each is known at
compile time.  So there's no need to compute their length every
time any file's attribute is listed.

Record the length of each string and use it when needed to determine
the space need to represent them.  In addition, compute the
aggregate size of strings in each table just once at initialization
time.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Alex Elder
aa4066ed7b ceph: encode type in vxattr callback routines
The names of the callback functions used for virtual extended
attributes are based only on the last component of the attribute
name.  Because of the way these are defined, this precludes allowing
a single (lowest) attribute name for different callbacks, dependent
on the type of file being operated on.  (For example, it might be
nice to support both "ceph.dir.layout" and "ceph.file.layout".)

Just change the callback names to avoid this problem.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Alex Elder
881a5fa200 ceph: drop "_cb" from name of struct ceph_vxattr_cb
A struct ceph_vxattr_cb does not represent a callback at all, but
rather a virtual extended attribute itself.  Drop the "_cb" suffix
from its name to reflect that.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Alex Elder
eb78808446 ceph: use macros to normalize vxattr table definitions
Entries in the ceph virtual extended attribute tables all follow a
distinct pattern in their definition.  Enforce this pattern through
the use of a macro.

Also, a null name field signals the end of the table, so make that
be the first field in the ceph_vxattr_cb structure.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Alex Elder
2289190719 ceph: use a symbolic name for "ceph." extended attribute namespace
Use symbolic constants to define the top-level prefix for "ceph."
extended attribute names.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Alex Elder
06476a69d8 ceph: pass inode rather than table to ceph_match_vxattr()
All callers of ceph_match_vxattr() determine what to pass as the
first argument by calling ceph_inode_vxattrs(inode).  Just do that
inside ceph_match_vxattr() itself, changing it to take an inode
rather than the vxattr pointer as its first argument.

Also ensure the function works correctly for an empty table (i.e.,
containing only a terminating null entry).

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Alex Elder
b829c1954d ceph: don't null-terminate xattr values
For some reason, ceph_setxattr() allocates an extra byte in which a
'\0' is stored past the end of an extended attribute value.  This is
not needed, and is potentially misleading, so get rid of it.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00