Commit Graph

16084 Commits

Author SHA1 Message Date
Lachlan McIlroy
def6b3ba56 xfs_file_last_byte() needs to acquire ilock
We had some systems crash with this stack:

[<a00000010000cb20>] ia64_leave_kernel+0x0/0x280
[<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs]
[<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs]
[<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs]
[<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs]
[<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs]
[<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs]
[<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs]
[<a000000100162ea0>] __fput+0x1a0/0x420
[<a000000100163160>] fput+0x40/0x60

The problem here is that xfs_file_last_byte() does not acquire the
inode lock and can therefore race with another thread that is modifying
the extext list.  While xfs_bmap_last_offset() is trying to lookup
what was the last extent some extents were merged and the extent list
shrunk so the index we lookup is now beyond the end of the extent list
and potentially in a freed buffer.

Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-30 00:25:25 -05:00
J. Bruce Fields
e300a63ce4 nfsd4: replace callback thread by asynchronous rpc
We don't really need a synchronous rpc, and moving to an asynchronous
rpc allows us to do without this extra kthread.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 17:10:53 -04:00
J. Bruce Fields
3cef9ab266 nfsd4: lookup up callback cred only once
Lookup the callback cred once and then use it for all subsequent
callbacks.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 16:45:03 -04:00
J. Bruce Fields
ecdd03b791 nfsd4: create rpc callback client from server thread
The code is a little simpler, and it should be easier to avoid races, if
we just do all rpc client creation/destruction from nfsd or laundromat
threads and do only the rpc calls themselves asynchronously.  The rpc
creation doesn't involve any significant waiting (it doesn't call the
client, for example), so there's no reason not to do this.

Also don't bother destroying the client on failure of the rpc null
probe.  We may want to retry the probe later anyway.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 16:44:53 -04:00
J. Bruce Fields
e1cab5a589 nfsd4: set cb_client inside setup_callback_client
This is just a minor code simplification.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 16:44:47 -04:00
J. Bruce Fields
595947acaa nfsd4: set shorter timeout
We tried to do something overly complicated with the callback rpc
timeouts here.  And they're wrong--the result is that by the time a
single callback times out, it's already too late to tell the client
(using the cb_path_down return to RENEW) that the callback is down.

Use a much shorter, simpler timeout.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 16:44:40 -04:00
J. Bruce Fields
f64f79ea5f nfsd4: setclientid_confirm callback-change fixes
This setclientid_confirm case should allow the client to change
callbacks, but it currently has a dummy implementation that just turns
off callbacks completely.  That dummy implementation isn't completely
correct either, though:

	- There's no need to remove any client recovery directory in
	  this case.
	- New clientid confirm verifiers should be generated (and
	  returned) in setclientid; there's no need to generate a new
	  one here.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 16:44:34 -04:00
Christoph Hellwig
6321e3ed2a xfs: fix getbmap vs mmap deadlock
xfs_getbmap (or rather the formatters called by it) copy out the getbmap
structures under the ilock, which can deadlock against mmap.  This has
been reported via bugzilla a while ago (#717) and has recently also
shown up via lockdep.

So allocate a temporary buffer to format the kernel getbmap structures
into and then copy them out after dropping the locks.

A little problem with this is that we limit the number of extents we
can copy out by the maximum allocation size, but I see no real way
around that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-29 13:25:29 -05:00
Tao Ma
7e31a966ad ocfs2/trivial: Remove unused variable in ocfs2_rename.
With indexed dir enabled, now we use ocfs2_dir_lookup_result to
wrap all the bh used for dir. So remove the 2 unused variables.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-04-29 10:57:18 -07:00
J. Bruce Fields
b8fd47aefa nfsd: quiet compile warning
Stephen Rothwell said:
"Today's linux-next build (powerpc ppc64_defconfig) produced this new
warning:

fs/nfsd/nfs4state.c: In function 'EXPIRED_STATEID':
fs/nfsd/nfs4state.c:2757: warning: comparison of distinct pointer types lacks a cast

Caused by commit 78155ed75f ("nfsd4:
distinguish expired from stale stateids")."

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
2009-04-29 11:36:17 -04:00
J. Bruce Fields
c654b8a9cb nfsd: support ext4 i_version
ext4 supports a real NFSv4 change attribute, which is bumped whenever
the ctime would be updated, including times when two updates arrive
within a jiffy of each other.  (Note that although ext4 has space for
nanosecond-precision ctime, the real resolution is lower: it actually
uses jiffies as the time-source.)  This ensures clients will invalidate
their caches when they need to.

There is some fear that keeping the i_version up-to-date could have
performance drawbacks, so for now it's turned on only by a mount option.
We hope to do something better eventually.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Theodore Tso <tytso@mit.edu>
2009-04-29 11:35:49 -04:00
J. Bruce Fields
3352d2c2d0 nfsd4: delete obsolete xdr comments
We don't need comments to tell us these macros are ugly.  And we're long
past trying to share any of this code with the BSD's.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 11:35:49 -04:00
J. Bruce Fields
bc749ca4c4 nfsd: eliminate ENCODE_HEAD macro
This macro doesn't serve any useful purpose.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 11:35:49 -04:00
Christoph Hellwig
4be4a00fb5 xfs: a couple getbmap cleanups
- reshuffle various conditionals for data vs attr fork to make the code
   more readable
 - do fine-grainded goto-based error handling
 - exit early from conditionals instead of keeping a long else branch around
 - allow kmem_alloc to fail

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-29 10:00:01 -05:00
Olaf Weber
2ac00af7a6 xfs: add more checks to superblock validation
There had been reports where xfs filesystem was randomly
corrupted with fsfuzzer, and xfs failed to handle it
gracefully. This patch fixes couple of reported problem
by providing additional checks in the superblock
validation routine.

Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-29 09:24:29 -05:00
Lachlan McIlroy
f25181f598 xfs_file_last_byte() needs to acquire ilock
We had some systems crash with this stack:

[<a00000010000cb20>] ia64_leave_kernel+0x0/0x280
[<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs]
[<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs]
[<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs]
[<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs]
[<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs]
[<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs]
[<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs]
[<a000000100162ea0>] __fput+0x1a0/0x420
[<a000000100163160>] fput+0x40/0x60

The problem here is that xfs_file_last_byte() does not acquire the
inode lock and can therefore race with another thread that is modifying
the extext list.  While xfs_bmap_last_offset() is trying to lookup
what was the last extent some extents were merged and the extent list
shrunk so the index we lookup is now beyond the end of the extent list
and potentially in a freed buffer.

Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-29 09:14:10 -05:00
Ingo Molnar
e7fd5d4b3d Merge branch 'linus' into perfcounters/core
Merge reason: This brach was on -rc1, refresh it to almost-rc4 to pick up
              the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 14:47:05 +02:00
FUJITA Tomonori
69838727bc bio: fix memcpy corruption in bio_copy_user_iov()
st driver uses blk_rq_map_user() in order to just build a request out
of page frames. In this case, map_data->offset is a non zero value and
iov[0].iov_base is NULL. We need to increase nr_pages for that.

Cc: stable@kernel.org
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28 20:24:29 +02:00
Chuck Lever
e06b64050e NFSD: Stricter buffer size checking in fs/nfsd/nfsctl.c
Clean up: For consistency, handle output buffer size checking in a
other nfsctl functions the same way it's done for write_versions().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:30 -04:00
Chuck Lever
261758b5c3 NFSD: Stricter buffer size checking in write_versions()
While it's not likely today that there are enough NFS versions to
overflow the output buffer in write_versions(), we should be more
careful about detecting the end of the buffer.

The number of NFS versions will only increase as NFSv4 minor versions
are added.

Note that this API doesn't behave the same as portlist.  Here we
attempt to display as many versions as will fit in the buffer, and do
not provide any indication that an overflow would have occurred.  I
don't have any good rationale for that.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:30 -04:00
Chuck Lever
3d72ab8fdd NFSD: Stricter buffer size checking in write_recoverydir()
While it's not likely a pathname will be longer than
SIMPLE_TRANSACTION_SIZE, we should be more careful about just
plopping it into the output buffer without bounds checking.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:30 -04:00
Chuck Lever
8435d34dbb SUNRPC: pass buffer size to svc_sock_names()
Adjust the synopsis of svc_sock_names() to pass in the size of the
output buffer.  Add a documenting comment.

This is a cosmetic change for now.  A subsequent patch will make sure
the buffer length is passed to one_sock_name(), where the length will
actually be useful.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:28 -04:00
Chuck Lever
bfba9ab4c6 SUNRPC: pass buffer size to svc_addsock()
Adjust the synopsis of svc_addsock() to pass in the size of the output
buffer.  Add a documenting comment.

This is a cosmetic change for now.  A subsequent patch will make sure
the buffer length is passed to one_sock_name(), where the length will
actually be useful.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:28 -04:00
Chuck Lever
335c54bdc4 NFSD: Prevent a buffer overflow in svc_xprt_names()
The svc_xprt_names() function can overflow its buffer if it's so near
the end of the passed in buffer that the "name too long" string still
doesn't fit.  Of course, it could never tell if it was near the end
of the passed in buffer, since its only caller passes in zero as the
buffer length.

Let's make this API a little safer.

Change svc_xprt_names() so it *always* checks for a buffer overflow,
and change its only caller to pass in the correct buffer length.

If svc_xprt_names() does overflow its buffer, it now fails with an
ENAMETOOLONG errno, instead of trying to write a message at the end
of the buffer.  I don't like this much, but I can't figure out a clean
way that's always safe to return some of the names, *and* an
indication that the buffer was not long enough.

The displayed error when doing a 'cat /proc/fs/nfsd/portlist' is
"File name too long".

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:28 -04:00
Chuck Lever
ea068bad27 NFSD: move lockd_up() before svc_addsock()
Clean up.

A couple of years ago, a series of commits, finishing with commit
5680c446, swapped the order of the lockd_up() and svc_addsock() calls
in __write_ports().  At that time lockd_up() needed to know the
transport protocol of the passed-in socket to start a listener on the
same transport protocol.

These days, lockd_up() doesn't take a protocol argument; it always
starts both a UDP and TCP listener.  It's now more straightforward to
try the lockd_up() first, then do a lockd_down() if the svc_addsock()
fails.

Careful review of this code shows that the svc_sock_names() call is
used only to close the just-opened socket in case lockd_up() fails.
So it is no longer needed if lockd_up() is done first.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:28 -04:00
Chuck Lever
0a5372d8a1 NFSD: Finish refactoring __write_ports()
Clean up: Refactor transport name listing out of __write_ports() to
make it easier to understand and maintain.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:27 -04:00
Chuck Lever
c71206a7b4 NFSD: Note an additional requirement when passing TCP sockets to portlist
User space must call listen(3) on SOCK_STREAM sockets passed into
/proc/fs/nfsd/portlist, otherwise that listener is ignored.  Document
this.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:27 -04:00
Chuck Lever
0b7c2f6fc7 NFSD: Refactor socket creation out of __write_ports()
Clean up: Refactor the socket creation logic out of __write_ports() to
make it easier to understand and maintain.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:27 -04:00
Chuck Lever
82d565919a NFSD: Refactor portlist socket closing into a helper
Clean up: Refactor the socket closing logic out of __write_ports() to
make it easier to understand and maintain.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:26 -04:00
Chuck Lever
4eb68c266c NFSD: Refactor transport addition out of __write_ports()
Clean up: Refactor transport addition out of __write_ports() to make
it easier to understand and maintain.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:26 -04:00
Chuck Lever
4cd5dc751a NFSD: Refactor transport removal out of __write_ports()
Clean up: Refactor transport removal out of __write_ports() to make it
easier to understand and maintain.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:26 -04:00
Tejun Heo
08cbf542bf fuse: export symbols to be used by CUSE
Export the following symbols for CUSE.

fuse_conn_put()
fuse_conn_get()
fuse_conn_kill()
fuse_send_init()
fuse_do_open()
fuse_sync_release()
fuse_direct_io()
fuse_do_ioctl()
fuse_file_poll()
fuse_request_alloc()
fuse_get_req()
fuse_put_request()
fuse_request_send()
fuse_abort_conn()
fuse_dev_release()
fuse_dev_operations

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:42 +02:00
Tejun Heo
a325f9b922 fuse: update fuse_conn_init() and separate out fuse_conn_kill()
Update fuse_conn_init() such that it doesn't take @sb and move bdi
registration into a separate function.  Also separate out
fuse_conn_kill() from fuse_put_super().

These will be used to implement cuse.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:41 +02:00
Miklos Szeredi
797759aaf3 fuse: don't use inode in fuse_file_poll
Use ff->fc and ff->nodeid instead of file->f_dentry->d_inode in the
fuse_file_poll() implementation.

This prepares this function for use by CUSE, where the inode is not
owned by a fuse filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:41 +02:00
Miklos Szeredi
d36f248710 fuse: don't use inode in fuse_do_ioctl() helper
Create a helper for sending an IOCTL request that doesn't use a struct
inode.

This prepares this function for use by CUSE, where the inode is not
owned by a fuse filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:39 +02:00
Miklos Szeredi
8b0797a498 fuse: don't use inode in fuse_sync_release()
Make fuse_sync_release() a generic helper function that doesn't need a
struct inode pointer.  This makes it suitable for use by CUSE.

Change return value of fuse_release_common() from int to void.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:39 +02:00
Miklos Szeredi
91fe96b403 fuse: create fuse_do_open() helper for CUSE
Create a helper for sending an OPEN request that doesn't need a struct
inode pointer.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:37 +02:00
Miklos Szeredi
c7b7143c63 fuse: clean up args in fuse_finish_open() and fuse_release_fill()
Move setting ff->fh, ff->nodeid and file->private_data outside
fuse_finish_open().  Add ->open_flags member to struct fuse_file.

This simplifies the argument passing to fuse_finish_open() and
fuse_release_fill(), and paves the way for creating an open helper
that doesn't need an inode pointer.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:37 +02:00
Miklos Szeredi
2106cb1893 fuse: don't use inode in helpers called by fuse_direct_io()
Use ff->fc and ff->nodeid instead of passing down the inode.

This prepares this function for use by CUSE, where the inode is not
owned by a fuse filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:37 +02:00
Miklos Szeredi
da5e471457 fuse: add members to struct fuse_file
Add new members ->fc and ->nodeid to struct fuse_file.  This will aid
in converting functions for use by CUSE, where the inode is not owned
by a fuse filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:36 +02:00
Miklos Szeredi
d09cb9d7f6 fuse: prepare fuse_direct_io() for CUSE
Move code operating on the inode out from fuse_direct_io().

This prepares this function for use by CUSE, where the inode is not
owned by a fuse filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:36 +02:00
Miklos Szeredi
2d698b0703 fuse: clean up fuse_write_fill()
Move out code from fuse_write_fill() which is not common to all
callers.  Remove two function arguments which become unnecessary.

Also remove unnecessary memset(), the request is already initialized
to zero.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:36 +02:00
Miklos Szeredi
b0be46ebf7 fuse: use struct path in release structure
Use struct path instead of separate dentry and vfsmount in
req->misc.release.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:36 +02:00
Miklos Szeredi
fd9db72977 fuse: destroy bdi on error
Destroy bdi on error in fuse_fill_super().

This was an omission from commit 26c3679101
"fuse: destroy bdi on umount", which moved the bdi_destroy() call from
fuse_conn_put() to fuse_put_super().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
2009-04-28 16:56:35 +02:00
Tejun Heo
6b2db28a7a fuse: misc cleanups
* fuse_file_alloc() was structured in weird way.  The success path was
  split between else block and code following the block.  Restructure
  the code such that it's easier to read and modify.

* Unindent success path of fuse_release_common() to ease future
  changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:35 +02:00
Jeff Moyer
db2dbb12dc block: implement blkdev_readpages
Doing a proper block dev ->readpages() speeds up the crazy dump(8)
approach of using interleaved process IO.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28 07:37:33 +02:00
Theodore Ts'o
8b0f9e8f78 ext4: avoid unnecessary spinlock in critical POSIX ACL path
If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem
to do POSIX ACL checks on any files not owned by the caller, and it does
this for every single pathname component that it looks up.

That obviously can be pretty expensive if the filesystem isn't careful
about it, especially with locking. That's doubly sad, since the common
case tends to be that there are no ACL's associated with the files in
question.

ext4 already caches the ACL data so that it doesn't have to look it up
over and over again, but it does so by taking the inode->i_lock spinlock
on every lookup. Which is a noticeable overhead even if it's a private
lock, especially on CPU's where the serialization is expensive (eg Intel
Netburst aka 'P4').

For the special case of not actually having any ACL's, all that locking is
unnecessary. Even if somebody else were to be changing the ACL's on
another CPU, we simply don't care - if we've seen a NULL ACL, we might as
well use it.

So just load the ACL speculatively without any locking, and if it was
NULL, just use it. If it's non-NULL (either because we had a cached
entry, or because the cache hasn't been filled in at all), it means that
we'll need to get the lock and re-load it properly.

(This commit was ported from a patch originally authored by Linus for
ext3.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-27 17:33:23 -04:00
Linus Torvalds
96159f2511 ext3: avoid unnecessary spinlock in critical POSIX ACL path
If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem 
to do POSIX ACL checks on any files not owned by the caller, and it does 
this for every single pathname component that it looks up.

That obviously can be pretty expensive if the filesystem isn't careful 
about it, especially with locking. That's doubly sad, since the common 
case tends to be that there are no ACL's associated with the files in 
question.

ext3 already caches the ACL data so that it doesn't have to look it up 
over and over again, but it does so by taking the inode->i_lock spinlock 
on every lookup. Which is a noticeable overhead even if it's a private 
lock, especially on CPU's where the serialization is expensive (eg Intel 
Netburst aka 'P4').

For the special case of not actually having any ACL's, all that locking is 
unnecessary. Even if somebody else were to be changing the ACL's on 
another CPU, we simply don't care - if we've seen a NULL ACL, we might as 
well use it.

So just load the ACL speculatively without any locking, and if it was 
NULL, just use it. If it's non-NULL (either because we had a cached 
entry, or because the cache hasn't been filled in at all), it means that 
we'll need to get the lock and re-load it properly.

This is noticeable even on Nehalem, which does locking quite well (much 
better than P4). From lmbench:

	Processor, Processes - times in microseconds - smaller is better
	--------------------------------------------------------------------
	Host                 OS  Mhz null null      open slct fork exec sh  
	                             call  I/O stat clos TCP  proc proc proc
	--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ----
 - before:
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.45 2.18 69.1 273. 1141
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.48 2.28 69.9 253. 1140
	nehalem.l Linux 2.6.30- 3193 0.04 0.10 0.95 1.42 2.19 68.6 284. 1141
 - after:
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.44 2.12 68.3 282. 1094
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.20 67.0 308. 1123
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.36 67.4 293. 1148

where you can see what appears to be a roughly 3% improvement in stat
and open/close latencies from just the removal of the locking overhead. 

Of course, this only matters for files you don't own (the owner never 
needs to do the ACL checks), but that's the common case for libraries, 
header files, and executables. As well as for the base components of any 
absolute pathname, even if you are the owner of the final file.

[ At some point we probably want to move this ACL caching logic entirely
  into the VFS layer (and only call down to the filesystem when
  uncached), but in the meantime this improves ext3 a bit.

  A similar fix to btrfs makes a much bigger difference (15x improvement
  in lmbench) due to broken caching. ]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2009-04-27 17:24:22 -04:00
Theodore Ts'o
9bffad1ed2 ext4: convert instrumentation from markers to tracepoints
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17 11:48:11 -04:00
Theodore Ts'o
879c5e6b7c jbd2: convert instrumentation from markers to tracepoints
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17 11:47:48 -04:00