The Committed_AS field can underflow in certain situations:
> # while true; do cat /proc/meminfo | grep _AS; sleep 1; done | uniq -c
> 1 Committed_AS: 18446744073709323392 kB
> 11 Committed_AS: 18446744073709455488 kB
> 6 Committed_AS: 35136 kB
> 5 Committed_AS: 18446744073709454400 kB
> 7 Committed_AS: 35904 kB
> 3 Committed_AS: 18446744073709453248 kB
> 2 Committed_AS: 34752 kB
> 9 Committed_AS: 18446744073709453248 kB
> 8 Committed_AS: 34752 kB
> 3 Committed_AS: 18446744073709320960 kB
> 7 Committed_AS: 18446744073709454080 kB
> 3 Committed_AS: 18446744073709320960 kB
> 5 Committed_AS: 18446744073709454080 kB
> 6 Committed_AS: 18446744073709320960 kB
Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does
not check for underflow.
But NR_CPUS proportional isn't good calculation. In general,
possibility of lock contention is proportional to the number of online
cpus, not theorical maximum cpus (NR_CPUS).
The current kernel has generic percpu-counter stuff. using it is right
way. it makes code simplify and percpu_counter_read_positive() don't
make underflow issue.
Reported-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org> [All kernel versions]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes the problem introduced by commit 3bfacef412 (get rid of
special-casing the /sbin/loader on alpha): osf/1 ecoff binary segfaults
when binfmt_aout built as module. That happens because aout binary
handler gets on the top of the binfmt list due to late registration, and
kernel attempts to execute the binary without preparatory work that must
be done by binfmt_loader.
Fixed by changing the registration order of the default binfmt handlers
using list_add_tail() and introducing insert_binfmt() function which
places new handler on the top of the binfmt list. This might be generally
useful for installing arch-specific frontends for default handlers or just
for overriding them.
Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Richard Henderson <rth@twiddle.net
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The intention of commit aae8679b0e
("pagemap: fix bug in add_to_pagemap, require aligned-length reads of
/proc/pid/pagemap") was to force reads of /proc/pid/pagemap to be a
multiple of 8 bytes, but now it allows to read 0 bytes, which actually
puts some data to user's buffer. According to POSIX, if count is zero,
read() should return zero and has no other results.
Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Cc: Thomas Tuttle <ttuttle@google.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change page_mkwrite to allow implementations to return with the page
locked, and also change it's callers (in page fault paths) to hold the
lock until the page is marked dirty. This allows the filesystem to have
full control of page dirtying events coming from the VM.
Rather than simply hold the page locked over the page_mkwrite call, we
call page_mkwrite with the page unlocked and allow callers to return with
it locked, so filesystems can avoid LOR conditions with page lock.
The problem with the current scheme is this: a filesystem that wants to
associate some metadata with a page as long as the page is dirty, will
perform this manipulation in its ->page_mkwrite. It currently then must
return with the page unlocked and may not hold any other locks (according
to existing page_mkwrite convention).
In this window, the VM could write out the page, clearing page-dirty. The
filesystem has no good way to detect that a dirty pte is about to be
attached, so it will happily write out the page, at which point, the
filesystem may manipulate the metadata to reflect that the page is no
longer dirty.
It is not always possible to perform the required metadata manipulation in
->set_page_dirty, because that function cannot block or fail. The
filesystem may need to allocate some data structure, for example.
And the VM cannot mark the pte dirty before page_mkwrite, because
page_mkwrite is allowed to fail, so we must not allow any window where the
page could be written to if page_mkwrite does fail.
This solution of holding the page locked over the 3 critical operations
(page_mkwrite, setting the pte dirty, and finally setting the page dirty)
closes out races nicely, preventing page cleaning for writeout being
initiated in that window. This provides the filesystem with a strong
synchronisation against the VM here.
- Sage needs this race closed for ceph filesystem.
- Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
- I need it for fsblock.
- I suspect other filesystems may need it too (eg. btrfs).
- I have converted buffer.c to the new locking. Even simple block allocation
under dirty pages might be susceptible to i_size changing under partial page
at the end of file (we also have a buffer.c-side problem here, but it cannot
be fixed properly without this patch).
- Other filesystems (eg. NFS, maybe btrfs) will need to change their
page_mkwrite functions themselves.
[ This also moves page_mkwrite another step closer to fault, which should
eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
filesystem calldown and page lock/unlock cycle in __do_fault. ]
[akpm@linux-foundation.org: fix derefs of NULL ->mapping]
Cc: Sage Weil <sage@newdream.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Carl Henrik Lunde reported and debugged this; the test for the
last allocated block was comparing bytes to blocks in this test:
if (logical + length - 1 == EXT_MAX_BLOCK ||
ext4_ext_next_allocated_block(path) == EXT_MAX_BLOCK)
flags |= FIEMAP_EXTENT_LAST;
so any extent which ended right at 4G was stopping the extent
walk. Just replacing these values with the extent block &
length should fix it.
Also give blksize_bits a saner type, and reverse the order
of the tests to make the more likely case tested first.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Tested-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The fiemap and get_blk_size ioctls should be enabled even for
directories. So move it outisde file_ioctl.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In memory-constrained systems with many partitions, the ~68K for each
partition for the mb_history buffer can be excessive.
This patch adds a new mount option, mb_history_length, as well as a
way of setting the default via a module parameter (or via a sysfs
parameter in /sys/module/ext4/parameter/default_mb_history_length).
If the mb_history_length is set to zero, the mb_history facility is
disabled entirely.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Move this out of a local variable into the nfs4_delegation object in
preparation for making this an async rpc call (at which point we'll need
any state like this in a common object that's preserved across function
calls).
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
There's no point in keeping this field around--it's always zero.
(Background: the protocol allows you to tell the client that the file is
about to be truncated, as an optimization to save the client from
writing back dirty pages that will just be discarded. We don't
implement this hint. If we do some day, adding this field back in will
be the least of the work involved.)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The nfs4_cb_recall struct is used only in nfs4_delegation, so its
pointer to the containing delegation is unnecessary--we could just use
container_of().
But there's no real reason to have this a separate struct at all--just
move these fields to nfs4_delegation.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
I want to use the name for a struct that actually does represent a
single callback.
(Actually, I've never been sure it helps to a separate struct for the
callback information. Some day maybe those fields could just be dumped
into struct nfs4_client. I don't know.)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The fs/ext4/namei.h header file had only a single function
declaration, and should have never been a standalone file. Move it
into ext4.h, where should have been from the beginning.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There is no longer a reason for a separate ext4_sb.h header file, so
move it into ext4.h just to make life easier for developers to find
the relevant data structures and typedefs. Should also speed up
compiles slightly, too.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There is no longer a reason for a separate ext4_i.h header file, so
move it into ext4.h just to make life easier for developers to find
the relevant data structures and typedefs. Should also speed up
compiles slightly, too.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
By avoiding the use of not-yet-used block groups (i.e., block groups
with the BLOCK_UNINIT flag), mballoc had a tendency to create large
files with large non-contiguous gaps. In addition avoiding the use of
new block groups had a tendency to push regular file data into the
first block group in a flex_bg group, which slows down the speed of
e2fsck pass 2, since it has a tendency to seek much more. For
example:
Before Patch After Patch
Time in seconds Time in seconds
Real / User/ Sys MB/s Real / User/ Sys MB/s
Pass 1 8.52 / 2.21 / 0.46 20.43 8.84 / 4.97 / 1.11 19.68
Pass 2 21.16 / 1.02 / 1.86 11.30 6.54 / 1.77 / 1.78 36.39
Pass 3 0.01 / 0.00 / 0.00 139.00 0.01 / 0.01 / 0.00 128.90
Pass 4 0.16 / 0.15 / 0.00 0.00 0.17 / 0.17 / 0.00 0.00
Pass 5 2.52 / 1.99 / 0.09 0.79 2.31 / 1.78 / 0.06 0.86
Total 32.40 / 5.11 / 2.49 12.81 17.99 / 8.75 / 2.98 23.01
This was on a sample 80 gig root filesystem which was approximately
50% full. Note the improved e2fsck pass 2 performance, by over a
factor of 3, due to a decreased number of seeks. (The total amount of
I/O in pass 2 was unchanged; the layout of the directory blocks was
simply much better from e2fsck's's perspective.)
Other changes as a result of this patch on this sample filesystem:
Before Patch After Patch
# of non-contig files 762 779
# of non-contig directories 571 570
# of BLOCK_UNINIT bg's 307 293
# of INODE_UNINIT bg's 503 503
Out of 640 block groups, of which 333 were in use, this patch caused
an extra 14 block groups to be utilized. The number of non-contiguous
files did go up slightly, but when measured against the 99.9% of the
files (603,154) which were contiguously allocated, this is pretty
insignificant.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andreas Dilger <adilger@sun.com>
When multiply mounting from the same client to the same server, with
different userids, we create a vcnum which should be unique if
possible (this is not the same as the smb uid, which is the handle
to the security context). We were not endian converting additional
(beyond the first which is zero) vcnum properly.
CC: Stable <stable@kernel.org>
Acked-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Removes two sparse CHECK_ENDIAN warnings from Jeffs earlier patch,
and removes the dead readlink code (after noting where in
findfirst we will need to add something like that in the future
to handle the newly discovered unexpected error on FindFirst of NTFS symlinks.
Signed-off-by: Steve French <sfrench@us.ibm.com>
The earlier patch to move this code to use the new unicode helpers
assumed that the filename strings would be null terminated. That's not
always the case.
Instead of passing "max_len" to the string converter, pass "min(len,
max_len)", which makes it do the right thing while still keeping the
parser confined to the response. Also fix up the prototypes of this
function and the callers so that max_len is unsigned (like len is).
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The ocfs2 directory index updates two blocks when we remove an entry -
the dx root and the dx leaf. OCFS2_DELETE_INODE_CREDITS was only
accounting for the dx leaf. This shows up when ocfs2_delete_inode()
runs out of credits in jbd2_journal_dirty_metadata() at
"J_ASSERT_JH(jh, handle->h_buffer_credits > 0);".
The test that caught this was running dirop_file_racer from the
ocfs2-test suite with a 250-character filename PREFIX. Run on a 512B
blocksize, it forces the orphan dir index to grow large enough to
trigger.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This is necessary to avoid the conflict of syscall numbers.
Conflicts:
arch/x86/ia32/ia32entry.S
arch/x86/include/asm/unistd_32.h
arch/x86/include/asm/unistd_64.h
Fixes up the borked syscall numbers of perfcounters versus
preadv/pwritev as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
configfs_depend_item() recursively locks all inodes mutex from configfs root to
the target item, which makes lockdep unhappy. The purpose of this recursive
locking is to ensure that the item tree can be safely parsed and that the target
item, if found, is not about to leave.
This patch reworks configfs_depend_item() locking using configfs_dirent_lock.
Since configfs_dirent_lock protects all changes to the configfs_dirent tree, and
protects tagging of items to be removed, this lock can be used instead of the
inodes mutex lock chain.
This needs that the check for dependents be done atomically with
CONFIGFS_USET_DROPPING tagging.
Now lockdep looks happy with configfs.
[ Lifted the setting of s_type into configfs_new_dirent() to satisfy the
atomic setting of CONFIGFS_USET_CREATING -- Joel ]
Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
When attaching default groups (subdirs) of a new group (in mkdir() or
in configfs_register()), configfs recursively takes inode's mutexes
along the path from the parent of the new group to the default
subdirs. This is needed to ensure that the VFS will not race with
operations on these sub-dirs. This is safe for the following reasons:
- the VFS allows one to lock first an inode and second one of its
children (The lock subclasses for this pattern are respectively
I_MUTEX_PARENT and I_MUTEX_CHILD);
- from this rule any inode path can be recursively locked in
descending order as long as it stays under a single mountpoint and
does not follow symlinks.
Unfortunately lockdep does not know (yet?) how to handle such
recursion.
I've tried to use Peter Zijlstra's lock_set_subclass() helper to
upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know
that we might recursively lock some of their descendant, but this
usage does not seem to fit the purpose of lock_set_subclass() because
it leads to several i_mutex locked with subclass I_MUTEX_PARENT by
the same task.
>From inside configfs it is not possible to serialize those recursive
locking with a top-level one, because mkdir() and rmdir() are already
called with inodes locked by the VFS. So using some
mutex_lock_nest_lock() is not an option.
I am proposing two solutions:
1) one that wraps recursive mutex_lock()s with
lockdep_off()/lockdep_on().
2) (as suggested earlier by Peter Zijlstra) one that puts the
i_mutexes recursively locked in different classes based on their
depth from the top-level config_group created. This
induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
nesting of configfs default groups whenever lockdep is activated
but this limit looks reasonably high. Unfortunately, this also
isolates VFS operations on configfs default groups from the others
and thus lowers the chances to detect locking issues.
Nobody likes solution 1), which I can understand.
This patch implements solution 2). However lockdep is still not happy with
configfs_depend_item(). Next patch reworks the locking of
configfs_depend_item() and finally makes lockdep happy.
[ Note: This hides a few locking interactions with the VFS from lockdep.
That was my big concern, because we like lockdep's protection. However,
the current state always dumps a spurious warning. The locking is
correct, so I tell people to ignore the warning and that we'll keep
our eyes on the locking to make sure it stays correct. With this patch,
we eliminate the warning. We do lose some of the lockdep protections,
but this only means that we still have to keep our eyes on the locking.
We're going to do that anyway. -- Joel ]
Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In most cases, cifs_strndup is converting from Unicode (UCS2 / UTF-32) to
the configured local code page for the Linux mount (usually UTF8), so
Jeff suggested that to make it more clear that cifs_strndup is doing
a conversion not just memory allocation and copy, rename the function
to including "from_ucs" (ie Unicode)
Signed-off-by: Steve French <sfrench@us.ibm.com>
Added loop check when mounting DFS tree. mount will fail with
ELOOP if referral walks exceed MAX_NESTED_LINK count.
Signed-off-by: Igor Mammedov <niallain@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Having remote dfs root support in cifs_mount, we can
afford to pass into it UNC that is remote.
Signed-off-by: Igor Mammedov <niallain@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Two years ago, when the session setup code in cifs was rewritten and moved
to fs/cifs/sess.c, we were asked to keep the old code for a release or so
(which could be reenabled at runtime) since it was such a large change and
because the asn (SPNEGO) and NTLMSSP code was not rewritten and needed to
be. This was useful to avoid regressions, but is long overdue to be removed.
Now that the Kerberos (asn/spnego) code is working in fs/cifs/sess.c,
and the NTLMSSP code moved (NTLMSSP blob setup be rewritten with the
next patch in this series) quite a bit of dead code from fs/cifs/connect.c
now can be removed.
This old code should have been removed last year, but the earlier krb5
patches did not move/remove the NTLMSSP code which we had asked to
be done first. Since no one else volunteered, I am doing it now.
It is extremely important that we continue to examine the documentation
for this area, to make sure our code continues to be uptodate with
changes since Windows 2003.
Signed-off-by: Steve French <sfrench@us.ibm.com>
...and remove cifs_convertUCSpath. There are no more callers. Also add a
#define for the buffer used in the readdir path so that we don't have so
many magic numbers floating around.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Change CIFSSMBUnixQuerySymLink to use the new unicode helper functions.
Also change the calling conventions so that the allocation of the target
name buffer is done in CIFSSMBUnixQuerySymLink rather than by the caller.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
...and change decode_unicode_ssetup to be a void function. It never
returns an actual error anyway.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Rename cifs_strlcpy_to_host to cifs_strndup since that better describes
what this function really does. Then, convert it to use the new string
conversion and measurement functions that work in units of bytes rather
than wide chars.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Working in units of words means we do a lot of unnecessary conversion back
and forth. Standardize on bytes instead since that's more useful for
allocating buffers and such. Also, remove hostlen_fromUCS since the new
function has a similar purpose.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Add a replacement function for cifs_strtoUCS_le. cifs_from_ucs2
takes args for the source and destination length so that we can ensure
that the function is confined within the intended buffers.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
xfs_getbmap (or rather the formatters called by it) copy out the getbmap
structures under the ilock, which can deadlock against mmap. This has
been reported via bugzilla a while ago (#717) and has recently also
shown up via lockdep.
So allocate a temporary buffer to format the kernel getbmap structures
into and then copy them out after dropping the locks.
A little problem with this is that we limit the number of extents we
can copy out by the maximum allocation size, but I see no real way
around that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
- reshuffle various conditionals for data vs attr fork to make the code
more readable
- do fine-grainded goto-based error handling
- exit early from conditionals instead of keeping a long else branch around
- allow kmem_alloc to fail
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
There had been reports where xfs filesystem was randomly
corrupted with fsfuzzer, and xfs failed to handle it
gracefully. This patch fixes couple of reported problem
by providing additional checks in the superblock
validation routine.
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>