Commit Graph

15761 Commits

Author SHA1 Message Date
Tejun Heo
c995905916 block: fix diskstats access
There are two variants of stat functions - ones prefixed with double
underbars which don't care about preemption and ones without which
disable preemption before manipulating per-cpu counters.  It's unclear
whether the underbarred ones assume that preemtion is disabled on
entry as some callers don't do that.

This patch unifies diskstats access by implementing disk_stat_lock()
and disk_stat_unlock() which take care of both RCU (for partition
access) and preemption (for per-cpu counter access).  diskstats access
should always be enclosed between the two functions.  As such, there's
no need for the versions which disables preemption.  They're removed
and double underbars ones are renamed to drop the underbars.  As an
extra argument is added, there's no danger of using the old version
unconverted.

disk_stat_lock() uses get_cpu() and returns the cpu index and all
diskstat functions which access per-cpu counters now has @cpu
argument to help RT.

This change adds RCU or preemption operations at some places but also
collapses several preemption ops into one at others.  Overall, the
performance difference should be negligible as all involved ops are
very lightweight per-cpu ones.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:06 +02:00
Tejun Heo
e71bf0d0ee block: fix disk->part[] dereferencing race
disk->part[] is protected by its matching bdev's lock.  However,
non-critical accesses like collecting stats and printing out sysfs and
proc information used to be performed without any locking.  As
partitions can come and go dynamically, partitions can go away
underneath those non-critical accesses.  As some of those accesses are
writes, this theoretically can lead to silent corruption.

This patch fixes the race by using RCU for the partition array and dev
reference counter to hold partitions.

* Rename disk->part[] to disk->__part[] to make sure no one outside
  genhd layer proper accesses it directly.

* Use RCU for disk->__part[] dereferencing.

* Implement disk_{get|put}_part() which can be used to get and put
  partitions from gendisk respectively.

* Iterators are implemented to help iterate through all partitions
  safely.

* Functions which require RCU readlock are marked with _rcu suffix.

* Use disk_put_part() in __blkdev_put() instead of directly putting
  the contained kobject.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:06 +02:00
Tejun Heo
f331c0296f block: don't depend on consecutive minor space
* Implement disk_devt() and part_devt() and use them to directly
  access devt instead of computing it from ->major and ->first_minor.

  Note that all references to ->major and ->first_minor outside of
  block layer is used to determine devt of the disk (the part0) and as
  ->major and ->first_minor will continue to represent devt for the
  disk, converting these users aren't strictly necessary.  However,
  convert them for consistency.

* Implement disk_max_parts() to avoid directly deferencing
  genhd->minors.

* Update bdget_disk() such that it doesn't assume consecutive minor
  space.

* Move devt computation from register_disk() to add_disk() and make it
  the only one (all other usages use the initially determined value).

These changes clean up the code and will help disk->part dereference
fix and extended block device numbers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:05 +02:00
Tejun Heo
cf771cb5a7 block: make variable and argument names more consistent
In hd_struct, @partno is used to denote partition number and a number
of other places use @part to denote hd_struct.  Functions use @part
and @index instead.  This causes confusion and makes it difficult to
use consistent variable names for hd_struct.  Always use @partno if a
variable represents partition number.

Also, print out functions use @f or @part for seq_file argument.  Use
@seqf uniformly instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:05 +02:00
Tejun Heo
88e341261c block: update add_partition() error handling
d805dda4 tried to fix error case handling in add_partition() but had a
few problems.

* disk->part[] entry is set early and left dangling if operation
  fails.

* Once device initialized, the last put_device() is responsible for
  freeing all the resources.  The failure path freed part_stats and p
  regardless of put_device() causing double free.

* holders subdir holds reference to the disk device, so failure path
  should remove it to release resources properly which was missing.

This patch fixes the above problems and while at it move partition
slot busy check into add_partition() for completeness and inlines
holders subdirectory creation.  Using separate function for it just
obfuscates the code.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Abdel Benamrouche <draconux@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:04 +02:00
Tejun Heo
ec2cdedf79 block: allow deleting zero length partition
delete_partition() was noop for zero length partition.  As the
addition code allows creating zero lenght partition and deletion is
assumed to always succeed, this causes memory leak for zero length
partitions.  Allow zero length partitions to end their meaningless
lives.

While at it, allow deleting zero lenght partition via
BLKPG_DEL_PARTITION ioctl too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:04 +02:00
Mikulas Patocka
5df97b91b5 drop vmerge accounting
Remove hw_segments field from struct bio and struct request. Without virtual
merge accounting they have no purpose.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:03 +02:00
Mikulas Patocka
b8b3e16cfe block: drop virtual merging accounting
Remove virtual merge accounting.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:03 +02:00
David Woodhouse
8c540a96c1 Let the block device know when sectors can be discarded
[hirofumi@mail.parknet.co.jp: discard _after_ checking for corrupt chains]

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:01 +02:00
Steve French
b77d753c41 [CIFS] Check that last search entry resume key is valid
Jeff's recent patch to add a last_entry field in the search structure
to better construct resume keys did not validate that the server
sent us a plausible pointer to the last entry.  This adds that.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-10-08 19:13:46 +00:00
Trond Myklebust
d7fb120774 NFS: Don't use range_cyclic for data integrity syncs
It is more efficient to write linearly starting from the beginning of the
file.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 18:19:05 -04:00
Steve Dickson
8491945f11 NFS: Client mounts hang when exported directory do not exist
This patch fixes a regression that was introduced by the string based mounts.

nfs_mount() statically returns -EACCES for every error returned
by the remote mounted. This is incorrect because -EACCES is
an non-fatal error to the mount.nfs command. This error causes
mount.nfs to retry the mount even in the case when the exported
directory does not exist.

This patch maps the errors returned by the remote mountd into
valid errno values, exactly how it was done pre-string based
mounts. By returning the correct errno enables mount.nfs
to do the right thing.

Signed-off-by: Steve Dickson <steved@redhat.com>
[Trond.Myklebust@netapp.com: nfs_stat_to_errno() now correctly returns
 negative errors, so remove the sign change.]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 18:19:01 -04:00
J. Bruce Fields
ea31a4437c nfs: Fix misparsing of nfsv4 fs_locations attribute
The code incorrectly assumes here that the server name (or ip address)
is null-terminated.  This can cause referrals to fail in some cases.

Also support ipv6 addresses.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 18:17:47 -04:00
J. Bruce Fields
f0c929251e nfs: prepare to share nfs_set_port
We plan to use this function elsewhere.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 18:17:36 -04:00
J. Bruce Fields
460cdbc832 nfs: replace while loop by for loops in nfs_follow_referral
Whoever wrote this had a bizarre allergy to for loops.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 18:17:20 -04:00
J. Bruce Fields
4ada29d5c4 nfs: break up nfs_follow_referral
This function is a little longer and more deeply nested than necessary.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 18:16:40 -04:00
EG Keizer
37ca8f5c60 nfs: authenticated deep mounting
Allow mount to do authenticated mounts below the root of the exported tree.
The wording in RFC 2623, sec 2.3.2. allows fsinfo with UNIX authentication
on the root of the export. Mounts are not always done on the root
of the exported tree. Especially autoumounts often mount below the root of
the exported tree.
Some server implementations (justly) require full authentication for the
so-called deep mounts. The old code used AUTH_SYS only. This caused deep
mounts to fail on systems requiring stronger authentication..
The client should try both authentication types and use the first one that
succeeds.
This method was already partially implemented. This patch completes
the implementation for NFS2 and NFS3.
This patch was developed to allow Debian systems to automount home directories
on Solaris servers with krb5 authentication.

Tested on kernel 2.6.24-etchnhalf.1

Signed-off-by: E.G. Keizer <keie@few.vu.nl>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 18:16:22 -04:00
Jeff Layton
f25b874d39 NFS: missing nfs_fattr_init in nfs3_proc_getacl and nfs3_proc_setacls (resend #2)
The fattrs used in the NFSv3 getacl/setacl calls are not being properly
initialized. This occasionally causes nfs_update_inode to fall into
NFSv4 specific codepaths when handling post-op attrs from these calls.

Thanks to Cai Qian for noticing the spurious NFSv4 messages in debug
output from a v3 mount...

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 18:16:22 -04:00
J. Bruce Fields
f200c11c25 nfs: remove an obsolete nfs_flock comment
We *do* now allow bsd flocks over nfs.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 18:16:21 -04:00
Denis V. Lunev
44d5759d3f nfs: BUG_ON in nfs_follow_mountpoint
Unfortunately, BUG_ON(IS_ROOT(dentry)) can happen inside
nfs_follow_mountpoint with NFS running Fedora 8 using a
specific setup.
https://bugzilla.redhat.com/show_bug.cgi?id=458622

So, the situation should be handled on NFS client gracefully.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
CC: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 18:15:16 -04:00
Denis V. Lunev
fd08d7e9d1 nfs: ERR_PTR is expected on failure from nfs_do_clone_mount
Replace NULL with ERR_PTR(-EINVAL).

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 18:14:34 -04:00
Adrian Bunk
bb8a3b53c2 fix fs/nfs/nfsroot.c compilation
This patch fixes the following compile error caused by
commit f9247273cb
(UFS: add const to parser token tabl):

<--  snip  -->

...
  CC      fs/nfs/nfsroot.o
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/nfs/nfsroot.c:130: error: tokens causes a section type conflict
make[3]: *** [fs/nfs/nfsroot.o] Error 1

<--  snip  -->

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 17:59:49 -04:00
Trond Myklebust
691beb13cd NFS: Allow concurrent inode revalidation
Currently, if two processes are both trying to revalidate metadata for the
same inode, they will find themselves being serialised. There is no good
justification for this now that we have improved our ability to detect
stale attribute data, so we should remove that serialisation.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 17:59:43 -04:00
Trond Myklebust
2f28ea614f NFS: Fix up nfs_setattr_update_inode()
Ensure that it sets the inode metadata under the correct spinlock.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 17:41:46 -04:00
Trond Myklebust
076f1fc94c NFS: Don't clear nfsi->cache_validity in nfs_check_inode_attributes()
If we're merely checking the inode attributes because we suspect that the
'updated' attributes returned by the RPC call are stale, then we shouldn't
be doing weak cache consistency updates or clearing the cache_validity
flags.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 17:41:33 -04:00
Trond Myklebust
4dc05efb86 NFS: Convert __nfs_revalidate_inode() to use nfs_refresh_inode()
In the case where there are parallel RPC calls to the same inode, we may
receive stale metadata due to the lack of ordering, hence the sanity
checking of metadata in nfs_refresh_inode().
Currently, __nfs_revalidate_inode() is calling nfs_update_inode() directly,
without any further sanity checks, and hence may end up setting the inode
up with stale metadata.

Fix is to use nfs_refresh_inode() instead of nfs_update_inode().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 17:41:17 -04:00
Trond Myklebust
d65f557f39 NFS: Fix nfs_post_op_update_inode_force_wcc()
If we believe that the attributes are old (see nfs_refresh_inode()), then
we shouldn't force an update.
Also ensure that we hold the inode->i_lock across attribute checks and the
call to nfs_refresh_inode_locked() to ensure that we don't race with other
attribute updates.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 17:41:00 -04:00
Trond Myklebust
a10ad17630 NFS: Fix the NFS attribute update
Currently nfs_refresh_inode() will only update the inode metadata if it
sees that the RPC call that returned the nfs_fattr was started
after the last update of the inode. This means that if we have parallel
RPC calls to the same inode (when sending WRITE calls, for instance), we
may often miss updates.

This patch attempts to recover those missed updates by also accepting
them if the ctime in the nfs_fattr is more recent than the inode's
cached ctime.
It also recovers the case where the file size has increased, but the
ctime has not been updated due to limited ctime resolution.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 17:34:17 -04:00
Trond Myklebust
870a5be8b9 NFS: Clean up nfs_refresh_inode() and nfs_post_op_update_inode()
Try to avoid taking and dropping the inode->i_lock more than once. Do so by
moving the code in nfs_refresh_inode() that needs to be done under the
spinlock into a function nfs_refresh_inode_locked(), and then having both
nfs_refresh_inode() and nfs_post_op_update_inode() call it directly.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 17:29:49 -04:00
Trond Myklebust
7973c1f15a NFS: Add mount options for controlling the lookup cache
Add the following NFS-specific mount options to the parser.

    -o lookupcache=all          /* Default: cache positive & negative
                                   dentries */
    -o lookupcache=pos[itive]   /* Don't cache negative dentries */
    -o lookupcache=none         /* Strict revalidation of all dentries */

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 17:23:57 -04:00
Trond Myklebust
ff3525a539 NFS: Don't apply NFS_MOUNT_FLAGMASK to text-based mounts
The point of introducing text-based mounts was to allow us to add
functionality without having to worry about legacy binary mount formats.
The mask should be there in order to ensure that binary formats don't start
enabling features that they cannot support. There is no justification for
applying it to the text mount path.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 17:23:56 -04:00
Trond Myklebust
4eec952e42 NFS: Add options for finer control of the lookup cache
Add the flag NFS_MOUNT_LOOKUP_CACHE_NONEG to turn off the caching of
negative dentries. In reality what we do is to force
nfs_lookup_revalidate() to always discard negative dentries.

Add the flag NFS_MOUNT_LOOKUP_CACHE_NONE for enforcing stricter
revalidation of dentries. It forces the revalidate code to always do a
lookup instead of just checking the cached mtime of the parent directory.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-07 17:22:20 -04:00
Steve French
0752f1522a [CIFS] make sure we have the right resume info before calling CIFSFindNext
When we do a seekdir() or equivalent, we usually end up doing a
FindFirst call and then call FindNext until we get to the offset that we
want. The problem is that when we call FindNext, the code usually
doesn't have the proper info (mostly, the filename of the entry from the
last search) to resume the search.

Add a "last_entry" field to the cifs_search_info that points to the last
entry in the search. We calculate this pointer by using the
LastNameOffset field from the search parms that are returned. We then
use that info to do a cifs_save_resume_key before we call CIFSFindNext.

This patch allows CIFS to reliably pass the "telldir" connectathon test.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-10-07 20:03:33 +00:00
Steve French
6050247d80 [CIFS] clean up error handling in cifs_unlink
Currently, if a standard delete fails and we end up getting -EACCES
we try to clear ATTR_READONLY and try the delete again. If that
then fails with -ETXTBSY then we try a rename_pending_delete. We
aren't handling other errors appropriately though.

Another client could have deleted the file in the meantime and
we get back -ENOENT, for instance. In that case we wouldn't do a
d_drop. Instead of retrying in a separate call, simply goto the
original call and use the error handling from that.

Also, we weren't properly undoing any attribute changes that
were done before returning an error back to the caller.

CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-10-07 18:42:52 +00:00
Andi Kleen
39d80c33a0 ext4: Avoid double dirtying of super block in ext4_put_super()
While reading code I noticed that ext4_put_super() dirties the 
superblock bh twice. It is always done in ext4_commit_super()
too. Remove the redundant dirty operation.
Should be a nop semantically.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-10-06 21:37:44 -04:00
Eric Sandeen
6873fa0de1 Hook ext4 to the vfs fiemap interface.
ext4_ext_walk_space() was reinstated to be used for iterating over file
extents with a callback; it is used by the ext4 fiemap implementation.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
2008-10-07 00:46:36 -04:00
Trond Myklebust
1daef0a868 NFS: Clean up nfs_sb_active/nfs_sb_deactive
Instead of causing umount requests to block on server->active_wq while the
asynchronous sillyrename deletes are executing, we can use the sb->s_active
counter to obtain a reference to the super_block, and then release that
reference in nfs_async_unlink_release().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-06 20:08:26 -04:00
Trond Myklebust
d5e66348bb NFS: Fix nfs_file_llseek()
After the BKL removal patches were applied to the rest of the NFS code, the
BKL protection in nfs_file_llseek() is no longer sufficient to ensure that
inode->i_size is read safely in generic_file_llseek_unlocked().

In order to fix the situation, we either have to replace the naked read of
inode->i_size in generic_file_llseek_unlocked() with i_size_read(), or the
whole thing needs to be executed under the inode->i_lock;
In order to avoid disrupting other filesystems, avoid touching
generic_file_llseek_unlocked() for now...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-06 20:08:26 -04:00
Jeff Layton
6b37faa175 [CIFS] fix some settings of cifsAttrs after calling SetFileInfo and SetPathInfo
We only need to set them when we call SetFileInfo or SetPathInfo
directly, and as soon as possible after then. We had one place setting
it where it didn't need to be, and another place where it was missing.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-10-06 21:54:41 +00:00
Chuck Lever
2937391385 NLM: Remove unused argument from svc_addsock() function
Clean up: The svc_addsock() function no longer uses its "proto"
argument, so remove it.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-10-04 17:12:27 -04:00
Chuck Lever
26a4140923 NLM: Remove "proto" argument from lockd_up()
Clean up: Now that lockd_up() starts listeners for both transports, the
"proto" argument is no longer needed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-10-04 17:12:27 -04:00
Chuck Lever
8c3916f4bd NLM: Always start both UDP and TCP listeners
Commit 24e36663, which first appeared in 2.6.19, changed lockd so that
the client side starts a UDP listener only if there is a UDP NFSv2/v3
mount.  Its description notes:

    This... means that lockd will *not* listen on UDP if the only
    mounts are TCP mount (and nfsd hasn't started).

    The latter is the only one that concerns me at all - I don't know
    if this might be a problem with some servers.

Unfortunately it is a problem for Linux itself.  The rpc.statd daemon
on Linux uses UDP for contacting the local lockd, no matter which
protocol is used for NFS mounts.  Without a local lockd UDP listener,
NFSv2/v3 lock recovery from Linux NFS clients always fails.

Revert parts of commit 24e36663 so lockd_up() always starts both
listeners.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-10-04 17:08:16 -04:00
Josef Bacik
68c9d702bb generic block based fiemap implementation
Any block based fs (this patch includes ext3) just has to declare its own
fiemap() function and then call this generic function with its own
get_block_t. This works well for block based filesystems that will map
multiple contiguous blocks at one time, but will work for filesystems that
only map one block at a time, you will just end up with an "extent" for each
block. One gotcha is this will not play nicely where there is hole+data
after the EOF. This function will assume its hit the end of the data as soon
as it hits a hole after the EOF, so if there is any data past that it will
not pick that up. AFAIK no block based fs does this anyway, but its in the
comments of the function anyway just in case.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
2008-10-03 17:32:43 -04:00
Mark Fasheh
00dc417fa3 ocfs2: fiemap support
Plug ocfs2 into ->fiemap. Some portions of ocfs2_get_clusters() had to be
refactored so that the extent cache can be skipped in favor of going
directly to the on-disk records. This makes it easier for us to determine
which extent is the last one in the btree. Also, I'm not sure we want to be
caching fiemap lookups anyway as they're not directly related to data
read/write.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: ocfs2-devel@oss.oracle.com
Cc: linux-fsdevel@vger.kernel.org
2008-10-03 17:32:11 -04:00
Mark Fasheh
c4b929b85b vfs: vfs-level fiemap interface
Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap
inode operation.

Userspace can get extent information on a file via fiemap ioctl. As input,
the fiemap ioctl takes a struct fiemap which includes an array of struct
fiemap_extent (fm_extents). Size of the extent array is passed as
fm_extent_count and number of extents returned will be written into
fm_mapped_extents. Offset and length fields on the fiemap structure
(fm_start, fm_length) describe a logical range which will be searched for
extents. All extents returned will at least partially contain this range.
The actual extent offsets and ranges returned will be unmodified from their
offset and range on-disk.

The fiemap ioctl returns '0' on success. On error, -1 is returned and errno
is set. If errno is equal to EBADR, then fm_flags will contain those flags
which were passed in which the kernel did not understand. On all other
errors, the contents of fm_extents is undefined.

As fiemap evolved, there have been many authors of the vfs patch. As far as
I can tell, the list includes:
Kalpak Shah <kalpak.shah@sun.com>
Andreas Dilger <adilger@sun.com>
Eric Sandeen <sandeen@redhat.com>
Mark Fasheh <mfasheh@suse.com>

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: linux-api@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
2008-10-08 19:44:18 -04:00
Kalpak Shah
4d20c685fa ext4: fix xattr deadlock
ext4_xattr_set_handle() eventually ends up calling
ext4_mark_inode_dirty() which tries to expand the inode by shifting
the EAs.  This leads to the xattr_sem being downed again and leading
to a deadlock.

This patch makes sure that if ext4_xattr_set_handle() is in the
call-chain, ext4_mark_inode_dirty() will not expand the inode.

Signed-off-by: Kalpak Shah <kalpak.shah@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-08 23:21:54 -04:00
Theodore Ts'o
45a90bfd90 jbd2: Fix buffer head leak when writing the commit block
Also make sure the buffer heads are marked clean before submitting bh
for writing.  The previous code was marking the buffer head dirty,
which would have forced an unneeded write (and seek) to the journal
for no good reason.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-06 12:04:02 -04:00
Theodore Ts'o
ede86cc473 ext4: Add debugging markers that can be used by systemtap
This debugging markers are designed to debug problems such as the
random filesystem latency problems reported by Arjan.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-05 20:50:06 -04:00
Duane Griffin
23f8b79eae jbd2: abort instead of waiting for nonexistent transaction
The __jbd2_log_wait_for_space function sits in a loop checkpointing
transactions until there is sufficient space free in the journal. 
However, if there are no transactions to be processed (e.g.  because the
free space calculation is wrong due to a corrupted filesystem) it will
never progress.

Check for space being required when no transactions are outstanding and
abort the journal instead of endlessly looping.

This patch fixes the bug reported by Sami Liedes at:
http://bugzilla.kernel.org/show_bug.cgi?id=10976

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: Sami Liedes <sliedes@cc.hut.fi>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-08 23:28:31 -04:00
Frederic Bohe
c806e68f56 ext4: fix initialization of UNINIT bitmap blocks
This fixes a bug which caused on-line resizing of filesystems with a
1k blocksize to fail.  The root cause of this bug was the fact that if
an uninitalized bitmap block gets read in by userspace (which
e2fsprogs does try to avoid, but can happen when the blocksize is less
than the pagesize and an adjacent blocks is read into memory)
ext4_read_block_bitmap() was erroneously depending on the buffer
uptodate flag to decide whether it needed to initialize the bitmap
block in memory --- i.e., to set the standard set of blocks in use by
a block group (superblock, bitmaps, inode table, etc.).  Essentially,
ext4_read_block_bitmap() assumed it was the only routine that might
try to read a block containing a block bitmap, which is simply not
true.  

To fix this, ext4_read_block_bitmap() and ext4_read_inode_bitmap()
must always initialize uninitialized bitmap blocks.  Once a block or
inode is allocated out of that bitmap, it will be marked as
initialized in the block group descriptor, so in general this won't
result any extra unnecessary work.

Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-10 08:09:18 -04:00