This script causes a kernel deadlock:
set -e
DEVICE=/dev/vg1/linear
lvchange -ay $DEVICE
mkfs.ext3 $DEVICE
mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
quotacheck -gu /mnt/test
umount /mnt/test
mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
quotaon /mnt/test
dmsetup suspend $DEVICE
setquota -u root 1 2 3 4 /mnt/test &
sleep 1
dmsetup resume $DEVICE
setquota acquired semaphore s_umount for read and then tried to perform a
transaction (and waits because the device is suspended). dmsetup resume tries
to acquire s_umount for write before resuming the device (and waits for
setquota).
Fix the deadlock by grabbing a thawed superblock for quota commands which need
it.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In quota code we need to find a superblock corresponding to a device and wait
for superblock to be unfrozen. However this waiting has to happen without
s_umount semaphore because that is required for superblock to thaw. So provide
a function in VFS for this to keep dances with s_umount where they belong.
[AV: implementation switched to saner variant]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When the number of dentry cache hash table entries gets too high
(2147483648 entries), as happens by default on a 16TB system, use of a
signed integer in the dcache_init() initialization loop prevents the
dentry_hashtable from getting initialized, causing a panic in
__d_lookup(). Fix this in dcache_init() and similar areas.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When recursing down the locks when traversing a tree/list in
get_next_positive_dentry() or get_next_positive_subdir() a lock can
change from being nested to being a parent which breaks lockdep. This
patch tells lockdep about what we did.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It looks to me like the two ASSERT()s in xfs_trans_add_item() really
want to do a compare (==) rather than assignment (=).
This patch changes it from the latter to the former.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 05293485a0)
It looks to me like the two ASSERT()s in xfs_trans_add_item() really
want to do a compare (==) rather than assignment (=).
This patch changes it from the latter to the former.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Ben Myers <bpm@sgi.com>
Two bugfixes in XFS for 3.3: one fix passes KMEM_SLEEP to kmem_realloc
instead of 0, and the other resolves a possible deadlock in xfs quotas.
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: use a normal shrinker for the dquot freelist
xfs: pass KM_SLEEP flag to kmem_realloc() in xlog_recover_add_to_cnt_trans()
From RFC 5661 2.10.6.1: "If the previous sequence ID was 0xFFFFFFFF,
then the next request for the slot MUST have the sequence ID set to
zero."
While we're there, delete some redundant comments.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Says Jens:
"Time to push off some of the pending items. I really wanted to wait
until we had the regression nailed, but alas it's not quite there yet.
But I'm very confident that it's "just" a missing expire on exit, so
fix from Tejun should be fairly trivial. I'm headed out for a week on
the slopes.
- Killing the barrier part of mtip32xx. It doesn't really support
barriers, and it doesn't need them (writes are fully ordered).
- A few fixes from Dan Carpenter, preventing overflows of integer
multiplication.
- A fixup for loop, fixing a previous commit that didn't quite solve
the partial read problem from Dave Young.
- A bio integer overflow fix from Kent Overstreet.
- Improvement/fix of the door "keep locked" part of the cdrom shared
code from Paolo Benzini.
- A few cfq fixes from Shaohua Li.
- A fix for bsg sysfs warning when removing a file it did not create
from Stanislaw Gruszka.
- Two fixes for floppy from Vivek, preventing a crash.
- A few block core fixes from Tejun. One killing the over-optimized
ioc exit path, cleaning that up nicely. Two others fixing an oops
on elevator switch, due to calling into the scheduler merge check
code without holding the queue lock."
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix lockdep warning on io_context release put_io_context()
relay: prevent integer overflow in relay_open()
loop: zero fill bio instead of return -EIO for partial read
bio: don't overflow in bio_get_nr_vecs()
floppy: Fix a crash during rmmod
floppy: Cleanup disk->queue before caling put_disk() if add_disk() was never called
cdrom: move shared static to cdrom_device_info
bsg: fix sysfs link remove warning
block: don't call elevator callbacks for plug merges
block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions
mtip32xx: removed the irrelevant argument of mtip_hw_submit_io() and the unused member of struct driver_data
block: strip out locking optimization in put_io_context()
cdrom: use copy_to_user() without the underscores
block: fix ioc locking warning
block: fix NULL icq_cache reference
block,cfq: change code order
Stop reusing dquots from the freelist when allocating new ones directly, and
implement a shrinker that actually follows the specifications for the
interface. The shrinker implementation is still highly suboptimal at this
point, but we can gradually work on it.
This also fixes an bug in the previous lock ordering, where we would take
the hash and dqlist locks inside of the freelist lock against the normal
lock ordering. This is only solvable by introducing the dispose list,
and thus not when using direct reclaim of unused dquots for new allocations.
As a side-effect the quota upper bound and used to free ratio values in
/proc/fs/xfs/xqm are set to 0 as these values don't make any sense in the
new world order.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 04da0c8196)
Stop reusing dquots from the freelist when allocating new ones directly, and
implement a shrinker that actually follows the specifications for the
interface. The shrinker implementation is still highly suboptimal at this
point, but we can gradually work on it.
This also fixes an bug in the previous lock ordering, where we would take
the hash and dqlist locks inside of the freelist lock against the normal
lock ordering. This is only solvable by introducing the dispose list,
and thus not when using direct reclaim of unused dquots for new allocations.
As a side-effect the quota upper bound and used to free ratio values in
/proc/fs/xfs/xqm are set to 0 as these values don't make any sense in the
new world order.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
To ensure that we don't just reuse the bad delegation when we attempt to
recover the nfs4_state that received the bad stateid error.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
There were two places bio_get_nr_vecs() could overflow:
First, it did a left shift to convert from sectors to bytes immediately
before dividing by PAGE_SIZE. If PAGE_SIZE ever was less than 512 a great
many things would break, so dividing by PAGE_SIZE >> 9 is safe and will
generate smaller code too.
The nastier overflow was in the DIV_ROUND_UP() (that's what the code was
effectively doing, anyways). If n + d overflowed, the whole thing would
return 0 which breaks things rather effectively.
bio_get_nr_vecs() doesn't claim to give an exact value anyways, so the
DIV_ROUND_UP() is silly; we could do a straight divide except if a
device's queue_max_sectors was less than PAGE_SIZE we'd return 0. So we
just add 1; this should always be safe - things will break badly if
bio_get_nr_vecs() returns > BIO_MAX_PAGES (bio_alloc() will suddenly start
failing) but it's queue_max_segments that must guard against this, if
queue_max_sectors is preventing this from happen things are going to
explode on architectures with different PAGE_SIZE.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
standard_receive3 will check the validity of the response from the
server (via checkSMB). It'll pass the result of that check to handle_mid
which will dequeue it and mark it with a status of
MID_RESPONSE_MALFORMED if checkSMB returned an error. At that point,
standard_receive3 will also return an error, which will make the
demultiplex thread skip doing the callback for the mid.
This is wrong -- if we were able to identify the request and the
response is marked malformed, then we want the demultiplex thread to do
the callback. Fix this by making standard_receive3 return 0 in this
situation.
Cc: stable@vger.kernel.org
Reported-and-Tested-by: Mark Moseley <moseleymark@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
* git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix oops in session setup code for null user mounts
[CIFS] Update cifs Kconfig title to match removal of experimental dependency
cifs: fix printk format warnings
cifs: check offset in decode_ntlmssp_challenge()
cifs: NULL dereference on allocation failure
put_io_context() performed a complex trylock dancing to avoid
deferring ioc release to workqueue. It was also broken on UP because
trylock was always assumed to succeed which resulted in unbalanced
preemption count.
While there are ways to fix the UP breakage, even the most
pathological microbench (forced ioc allocation and tight fork/exit
loop) fails to show any appreciable performance benefit of the
optimization. Strip it out. If there turns out to be workloads which
are affected by this change, simpler optimization from the discussion
thread can be applied later.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function ext4_claim_inode() is only called by one function,
ext4_new_inode(), and by folding the functionality into
ext4_new_inode(), we can remove almost 50 lines of code, and put all
of the logic of allocating a new inode into a single place.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Network namespace is taken from request transport and passed as a part of
cb_process_state structure.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch makes nfs_clients_lock allocated per network namespace. All items it
protects are already network namespace aware.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch makes ID's infrastructure network namespace aware. This was done
mainly because of nfs_client_lock, which is desired to be per network
namespace, but protects NFS clients ID's.
NOTE: NFS client's net pointer have to be set prior to ID initialization,
proper assignment was moved.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch splits global list of NFS servers into per-net-ns array of lists.
This looks more strict and clearer.
BTW, this patch also makes "/proc/fs/nfsfs/volumes" content depends on /proc
mount owner pid namespace. See below for details.
NOTE: few words about how was /proc/fs/nfsfs/ entries content show per network
namespace done. This is a little bit tricky and not the best is could be. But
it's cheap (proper fix for /proc conteinerization is a hard nut to crack).
The idea is simple: take proper network namespace from pid namespace
child reaper nsproxy of /proc/ mount creator.
This actually means, that if there are 2 containers with different net
namespace sharing pid namespace, then read of /proc/fs/nfsfs/ entries will
always return content, taken from net namespace of pid namespace creator task
(and thus second namespace set wil be unvisible).
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch splits global list of NFS clients into per-net-ns array of lists.
This looks more strict and clearer.
BTW, this patch also makes "/proc/fs/nfsfs/servers" entry content depends on
/proc mount owner pid namespace. See below for details.
NOTE: few words about how was /proc/fs/nfsfs/ entries content show per network
namespace done. This is a little bit tricky and not the best is could be. But
it's cheap (proper fix for /proc conteinerization is a hard nut to crack).
The idea is simple: take proper network namespace from pid namespace
child reaper nsproxy of /proc/ mount creator.
This actually means, that if there are 2 containers with different net
namespace sharing pid namespace, then read of /proc/fs/nfsfs/ entries will
always return content, taken from net namespace of pid namespace creator task
(and thus second namespace set wil be unvisible).
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch removes the CONFIG_NFS_USE_NEW_IDMAPPER compile option.
First, the idmapper will attempt to map the id using /sbin/request-key
and nfsidmap. If this fails (if /etc/request-key.conf is not configured
properly) then the idmapper will call the legacy code to perform the
mapping. I left a comment stating where the legacy code begins to make
it easier for somebody to remove in the future.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
data_server_cache entries should only be treated as the same if the address
list hasn't changed.
A new entry will be made when an MDS changes an address list (as seen by
GETDEVINFO). The old entry will be freed once all references are gone.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch addresses printks that have some context to show that they are
from fs/nfs/, but for the sake of consistency now start with NFS:
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Messages like "Got error -10052 from the server on DESTROY_SESSION. Session
has been destroyed regardless" can be confusing to users who aren't very
familiar with NFS.
NOTE: This patch ignores any printks() that start by printing __func__ - that
will be in a separate patch.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I noticed that test_stateid() was always using the same stateid for open
and lock recovery. After poking around a bit, I discovered that it was
always testing with a delegation stateid (even if there was no
delegation present). I figured this wasn't correct, so now delegation
and open stateids are tested during open_expired() and lock stateids are
tested during lock_expired().
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This takes the guesswork out of what stateid to use. The caller is
expected to figure this out and pass in the correct one.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Setting the task name is done within setup_new_exec() by accessing
bprm->filename. However this happens after flush_old_exec().
This may result in a use after free bug, flush_old_exec() may
"complete" vfork_done, which will wake up the parent which in turn
may free the passed in filename.
To fix this add a new tcomm field in struct linux_binprm which
contains the now early generated task name until it is used.
Fixes this bug on s390:
Unable to handle kernel pointer dereference at virtual kernel address 0000000039768000
Process kworker/u:3 (pid: 245, task: 000000003a3dc840, ksp: 0000000039453818)
Krnl PSW : 0704000180000000 0000000000282e94 (setup_new_exec+0xa0/0x374)
Call Trace:
([<0000000000282e2c>] setup_new_exec+0x38/0x374)
[<00000000002dd12e>] load_elf_binary+0x402/0x1bf4
[<0000000000280a42>] search_binary_handler+0x38e/0x5bc
[<0000000000282b6c>] do_execve_common+0x410/0x514
[<0000000000282cb6>] do_execve+0x46/0x58
[<00000000005bce58>] kernel_execve+0x28/0x70
[<000000000014ba2e>] ____call_usermodehelper+0x102/0x140
[<00000000005bc8da>] kernel_thread_starter+0x6/0xc
[<00000000005bc8d4>] kernel_thread_starter+0x0/0xc
Last Breaking-Event-Address:
[<00000000002830f0>] setup_new_exec+0x2fc/0x374
Kernel panic - not syncing: Fatal exception: panic_on_oops
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix a regression in 16-bit Atmel NAND flash which was introduced in 3.1
- Fix breakage with MTD suspend caused by the API rework
- Fix a problem with resetting the MX28 BCH module
- A couple of other trivial fixes
* tag 'for-linus-3.3-20120204' of git://git.infradead.org/~dwmw2/mtd-3.3:
Revert "mtd: atmel_nand: optimize read/write buffer functions"
mtd: fix MTD suspend
jffs2: do not initialize variable unnecessarily
mtd: gpmi-nand bugfix: reset the BCH module when it is not MX23
mtd: nand: fix typo in comment
Commit bf118a342f (NFSv4: include bitmap
in nfsv4 get acl data) introduces the 'acl_scratch' page for the case
where we may need to decode multi-page data. However it fails to take
into account the fact that the variable may be NULL (for the case where
we're not doing multi-page decode), and it also attaches it to the
encoding xdr_stream rather than the decoding one.
The immediate result is an Oops in nfs4_xdr_enc_getacl due to the
call to page_address() with a NULL page pointer.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: stable@vger.kernel.org