Commit Graph

47524 Commits

Author SHA1 Message Date
Mimi Zohar
39eeb4fb97 security: define kernel_read_file hook
The kernel_read_file security hook is called prior to reading the file
into memory.

Changelog v4+:
- export security_kernel_read_file()

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
2016-02-21 09:06:09 -05:00
Mimi Zohar
09596b94f7 vfs: define kernel_read_file_from_path
This patch defines kernel_read_file_from_path(), a wrapper for the VFS
common kernel_read_file().

Changelog:
- revert error msg regression - reported by Sergey Senozhatsky
- Separated from the IMA patch

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2016-02-21 08:55:00 -05:00
Linus Torvalds
0389075ecf Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "This is unusually large, partly due to the EFI fixes that prevent
  accidental deletion of EFI variables through efivarfs that may brick
  machines.  These fixes are somewhat involved to maintain compatibility
  with existing install methods and other usage modes, while trying to
  turn off the 'rm -rf' bricking vector.

  Other fixes are for large page ioremap()s and for non-temporal
  user-memcpy()s"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Fix vmalloc_fault() to handle large pages properly
  hpet: Drop stale URLs
  x86/uaccess/64: Handle the caching of 4-byte nocache copies properly in __copy_user_nocache()
  x86/uaccess/64: Make the __copy_user_nocache() assembly code more readable
  lib/ucs2_string: Correct ucs2 -> utf8 conversion
  efi: Add pstore variables to the deletion whitelist
  efi: Make efivarfs entries immutable by default
  efi: Make our variable validation list include the guid
  efi: Do variable name validation tests in utf8
  efi: Use ucs2_as_utf8 in efivarfs instead of open coding a bad version
  lib/ucs2_string: Add ucs2 -> utf8 helper functions
2016-02-20 09:32:40 -08:00
Maxim Patlasov
7ae8fd0351 fs/pnode.c: treat zero mnt_group_id-s as unequal
propagate_one(m) calculates "type" argument for copy_tree() like this:

>    if (m->mnt_group_id == last_dest->mnt_group_id) {
>        type = CL_MAKE_SHARED;
>    } else {
>        type = CL_SLAVE;
>        if (IS_MNT_SHARED(m))
>           type |= CL_MAKE_SHARED;
>   }

The "type" argument then governs clone_mnt() behavior with respect to flags
and mnt_master of new mount. When we iterate through a slave group, it is
possible that both current "m" and "last_dest" are not shared (although,
both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison
above erroneously makes new mount shared and sets its mnt_master to
last_source->mnt_master. The patch fixes the problem by handling zero
mnt_group_id-s as though they are unequal.

The similar problem exists in the implementation of "else" clause above
when we have to ascend upward in the master/slave tree by calling:

>    last_source = last_source->mnt_master;
>    last_dest = last_source->mnt_parent;

proper number of times. The last step is governed by
"n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if
both are zero. The patch fixes this case in the same way as the former one.

[AV: don't open-code an obvious helper...]

Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20 00:15:52 -05:00
Al Viro
0bacbe528e affs_do_readpage_ofs(): just use kmap_atomic() around memcpy()
It forgets kunmap() on a failure exit, but there's really no point keeping
the page kmapped at all - after all, what we are doing is a bunch of memcpy()
into the parts of page, so kmap_atomic()/kunmap_atomic() just around those
memcpy() is enough.

Spotted-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20 00:15:51 -05:00
Mateusz Guzik
0e9a7da51b xattr handlers: plug a lock leak in simple_xattr_list
The code could leak xattrs->lock on error.

Problem introduced with 786534b92f "tmpfs: listxattr should
include POSIX ACL xattrs".

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20 00:15:51 -05:00
Wouter van Kesteren
2feb55f890 fs: allow no_seek_end_llseek to actually seek
The user-visible impact of the issue is for example that without this
patch sensors-detect breaks when trying to seek in /dev/cpu/0/cpuid.

'~0ULL' is a 'unsigned long long' that when converted to a loff_t,
which is signed, gets turned into -1. later in vfs_setpos we have
'if (offset > maxsize)', which makes it always return EINVAL.

Fixes: b25472f9b9 ("new helpers: no_seek_end_llseek{,_size}()")
Signed-off-by: Wouter van Kesteren <woutershep@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20 00:15:50 -05:00
Linus Torvalds
020ecbba05 Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
 "Miscellaneous ext4 bug fixes for v4.5"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix crashes in dioread_nolock mode
  ext4: fix bh->b_state corruption
  ext4: fix memleak in ext4_readdir()
  ext4: remove unused parameter "newblock" in convert_initialized_extent()
  ext4: don't read blocks from disk after extents being swapped
  ext4: fix potential integer overflow
  ext4: add a line break for proc mb_groups display
  ext4: ioctl: fix erroneous return value
  ext4: fix scheduling in atomic on group checksum failure
  ext4 crypto: move context consistency check to ext4_file_open()
  ext4 crypto: revalidate dentry after adding or removing the key
2016-02-19 13:44:12 -08:00
Linus Torvalds
ce6b71432d Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "My for-linus-4.5 branch has a btrfs DIO error passing fix.

  I know how much you love DIO, so I'm going to suggest against reading
  it.  We'll follow up with a patch to drop the error arg from
  dio_end_io in the next merge window."

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix direct IO requests not reporting IO error to user space
2016-02-19 13:40:42 -08:00
Linus Torvalds
87d9ac712b Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "10 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: slab: free kmem_cache_node after destroy sysfs file
  ipc/shm: handle removed segments gracefully in shm_mmap()
  MAINTAINERS: update Kselftest Framework mailing list
  devm_memremap_release(): fix memremap'd addr handling
  mm/hugetlb.c: fix incorrect proc nr_hugepages value
  mm, x86: fix pte_page() crash in gup_pte_range()
  fsnotify: turn fsnotify reaper thread into a workqueue job
  Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread"
  mm: fix regression in remap_file_pages() emulation
  thp, dax: do not try to withdraw pgtable from non-anon VMA
2016-02-19 13:36:00 -08:00
Al Viro
c1223ca48b orangefs: get rid of op refcounts
not needed anymore

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:56 -05:00
Al Viro
05a50a5be8 orangefs: have ..._clean_interrupted_...() wait for copy to/from daemon
* turn all those list_del(&op->list) into list_del_init()
* don't pick ops that are already given up in control device
  ->read()/->write_iter().
* have orangefs_clean_interrupted_operation() notice if op is currently
  being copied to/from daemon (by said ->read()/->write_iter()) and
  wait for that to finish.
* when we are done copying to/from daemon and find that it had been
  given up while we were doing that, wake the waiting ..._clean_interrupted_...

As the result, we are guaranteed that orangefs_clean_interrupted_operation(op)
doesn't return until nobody else can see op.  Moreover, we don't need to play
with op refcounts anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:56 -05:00
Al Viro
5964c1b839 orangefs: set correct ->downcall.status on failing to copy reply from daemon
... and clean the end of control device ->write_iter() while we are at it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:55 -05:00
Mike Marshall
ddb84da38d Orangefs: remove vestigial ASYNC code
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:55 -05:00
Mike Marshall
5253487e04 Orangefs: make some gossip statements more helpful.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:55 -05:00
Al Viro
897c5df6cf orangefs: get rid of op->done
shouldn't be needed now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:55 -05:00
Al Viro
82d37f19ff orangefs_readdir_index_put(): get rid of bufmap argument
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:54 -05:00
Al Viro
ea2c9c9f65 orangefs: bufmap rewrite
new waiting-for-slot logics:
	* make request for slot wait for bufmap to be set up if it
comes before it's installed *OR* while it's running down
	* make closing control device wait for all slots to be freed
	* waiting itself rewritten to (open-coded) analogues of wait_event_...
primitives - we would need wait_event_locked() and, pardon an obscenely
long name, wait_event_interruptible_exclusive_timeout_locked().
	* we never wait for more than slot_timeout_secs in total and,
if during the wait the daemon goes away, we only allow
ORANGEFS_BUFMAP_WAIT_TIMEOUT_SECS for it to come back.
	* (cosmetical) bitmap is used instead of an array of zeroes and ones
	* old (and only reached if we are about to corrupt memory) waiting
for daemon restart in service_operation() removed.

[Martin's fixes folded]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:54 -05:00
Al Viro
178041848a orangefs_bufmap_..._query(): don't bother with refcounts
... just hold the spinlock while fetching the field in question.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:54 -05:00
Al Viro
05b39a8b5c orangefs: lift handling of timeouts and attempts count to service_operation()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:54 -05:00
Al Viro
c72f15b7d9 service_operation(): don't block signals, just use ..._killable
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:53 -05:00
Al Viro
98815ade9e orangefs: sanitize handling of request list
* checking that daemon is running (to decide whether we want to limit
the timeout) should be done *after* the damn thing is included into
the list; doing that before means that if the daemon gets shut down
in between, we'll end up waiting indefinitely (== up to kill -9).

* cancels should go into the head of the queue - the sooner they
are picked, the less work daemon has to do and the sooner we get to
free the slot held by aborted operation.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:53 -05:00
Al Viro
d2d87a3b6d orangefs: get rid of loop in wait_for_matching_downcall()
turn op->waitq into struct completion...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:53 -05:00
Martin Brandenburg
cf22644a0e orangefs: use S_ISREG(mode) and friends instead of mode & S_IFREG.
Suggestion from Dan Carpenter.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:53 -05:00
Al Viro
78699e29fd orangefs: delay freeing slot until cancel completes
Make cancels reuse the aborted read/write op, to make sure they do not
fail on lack of memory.

Don't issue a cancel unless the daemon has seen our read/write, has not
replied and isn't being shut down.

If cancel *is* issued, don't wait for it to complete; stash the slot
in there and just have it freed when cancel is finally replied to or
purged (and delay dropping the reference until then, obviously).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-02-19 13:45:53 -05:00
Eric Sandeen
6332b9b5e7 ext4: Make Q_GETNEXTQUOTA work for quota in hidden inodes
We forgot to set .get_nextdqblk operation in quotactl_ops structure used
by ext4 when quota is using hidden inode thus the operation was not
really supported. Fix the omission.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-02-19 19:28:07 +01:00
Jan Kara
74dae42785 ext4: fix crashes in dioread_nolock mode
Competing overwrite DIO in dioread_nolock mode will just overwrite
pointer to io_end in the inode. This may result in data corruption or
extent conversion happening from IO completion interrupt because we
don't properly set buffer_defer_completion() when unlocked DIO races
with locked DIO to unwritten extent.

Since unlocked DIO doesn't need io_end for anything, just avoid
allocating it and corrupting pointer from inode for locked DIO.
A cleaner fix would be to avoid these games with io_end pointer from the
inode but that requires more intrusive changes so we leave that for
later.

Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-19 00:33:21 -05:00
Jan Kara
ed8ad83808 ext4: fix bh->b_state corruption
ext4 can update bh->b_state non-atomically in _ext4_get_block() and
ext4_da_get_block_prep(). Usually this is fine since bh is just a
temporary storage for mapping information on stack but in some cases it
can be fully living bh attached to a page. In such case non-atomic
update of bh->b_state can race with an atomic update which then gets
lost. Usually when we are mapping bh and thus updating bh->b_state
non-atomically, nobody else touches the bh and so things work out fine
but there is one case to especially worry about: ext4_finish_bio() uses
BH_Uptodate_Lock on the first bh in the page to synchronize handling of
PageWriteback state. So when blocksize < pagesize, we can be atomically
modifying bh->b_state of a buffer that actually isn't under IO and thus
can race e.g. with delalloc trying to map that buffer. The result is
that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the
corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock.

Fix the problem by always updating bh->b_state bits atomically.

CC: stable@vger.kernel.org
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-19 00:18:25 -05:00
Jeff Layton
0918f1c309 fsnotify: turn fsnotify reaper thread into a workqueue job
We don't require a dedicated thread for fsnotify cleanup.  Switch it
over to a workqueue job instead that runs on the system_unbound_wq.

In the interest of not thrashing the queued job too often when there are
a lot of marks being removed, we delay the reaper job slightly when
queueing it, to allow several to gather on the list.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18 16:23:24 -08:00
Jeff Layton
13d34ac6e5 Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread"
This reverts commit c510eff6be ("fsnotify: destroy marks with
call_srcu instead of dedicated thread").

Eryu reported that he was seeing some OOM kills kick in when running a
testcase that adds and removes inotify marks on a file in a tight loop.

The above commit changed the code to use call_srcu to clean up the
marks.  While that does (in principle) work, the srcu callback job is
limited to cleaning up entries in small batches and only once per jiffy.
It's easily possible to overwhelm that machinery with too many call_srcu
callbacks, and Eryu's reproduer did just that.

There's also another potential problem with using call_srcu here.  While
you can obviously sleep while holding the srcu_read_lock, the callbacks
run under local_bh_disable, so you can't sleep there.

It's possible when putting the last reference to the fsnotify_mark that
we'll end up putting a chain of references including the fsnotify_group,
uid, and associated keys.  While I don't see any obvious ways that that
could occurs, it's probably still best to avoid using call_srcu here
after all.

This patch reverts the above patch.  A later patch will take a different
approach to eliminated the dedicated thread here.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reported-by: Eryu Guan <guaneryu@gmail.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Cc: Jan Kara <jack@suse.com>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18 16:23:24 -08:00
Mimi Zohar
bc8ca5b92d vfs: define kernel_read_file_id enumeration
To differentiate between the kernel_read_file() callers, this patch
defines a new enumeration named kernel_read_file_id and includes the
caller identifier as an argument.

Subsequent patches define READING_KEXEC_IMAGE, READING_KEXEC_INITRAMFS,
READING_FIRMWARE, READING_MODULE, and READING_POLICY.

Changelog v3:
- Replace the IMA specific enumeration with a generic one.

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2016-02-18 17:14:04 -05:00
Mimi Zohar
b44a7dfc6f vfs: define a generic function to read a file from the kernel
For a while it was looked down upon to directly read files from Linux.
These days there exists a few mechanisms in the kernel that do just
this though to load a file into a local buffer.  There are minor but
important checks differences on each.  This patch set is the first
attempt at resolving some of these differences.

This patch introduces a common function for reading files from the kernel
with the corresponding security post-read hook and function.

Changelog v4+:
- export security_kernel_post_read_file() - Fengguang Wu
v3:
- additional bounds checking - Luis
v2:
- To simplify patch review, re-ordered patches

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Reviewed-by: Luis R. Rodriguez <mcgrof@suse.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2016-02-18 17:14:03 -05:00
Dave Hansen
c1192f8428 x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
The protection key can now be just as important as read/write
permissions on a VMA.  We need some debug mechanism to help
figure out if it is in play.  smaps seems like a logical
place to expose it.

arch/x86/kernel/setup.c is a bit of a weirdo place to put
this code, but it already had seq_file.h and there was not
a much better existing place to put it.

We also use no #ifdef.  If protection keys is .config'd out we
will effectively get the same function as if we used the weak
generic function.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Mark Williamson <mwilliamson@undo-software.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210227.4F8EB3F8@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-18 19:46:29 +01:00
Jan Kara
ccf370e43e quota: Forbid Q_GETQUOTA and Q_GETNEXTQUOTA for frozen filesystem
Commit 7955118eaf (quota: Allow Q_GETQUOTA for frozen filesystem)
allowed Q_GETQUOTA call for frozen filesystem. It makes sense on the
first look but zero-day testing has shown that with this change ext4
warns about starting a transaction for frozen filesystem. This happens
because ext4_acquire_dquot() prepares for allocating space for new quota
structure. Although it would be possible to implement Q_GETQUOTA for
ext4 without allocating space for non-existent structures, the matter
further complicates because OCFS2 needs to update on-disk structure use
count when a new cluster node loads quota information from disk. So just
revert the change and forbid Q_GETQUOTA together with Q_GETNEXTQUOTA for
frozen filesystem. Add comment to quotactl_cmd_write() to save us from
repeating this excercise in a few years when I forget again.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-02-18 14:03:03 +01:00
Jan Kara
044c9b6753 quota: Fix possible races during quota loading
When loading new quota structure from disk, there is a possibility caller
of dqget() will see uninitialized data due to CPU reordering loads or
stores - loads from dquot can be reordered before test of DQ_ACTIVE_B
bit or setting of this bit could be reordered before filling of the
structure. Fix the issue by adding proper memory barriers.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-02-18 13:34:41 +01:00
Kinglong Mee
aa66b0bb08 btrfs: fix memory leak of fs_info in block group cache
When starting up linux with btrfs filesystem, I got many memory leak
messages by kmemleak as,

unreferenced object 0xffff880066882000 (size 4096):
  comm "modprobe", pid 730, jiffies 4294690024 (age 196.599s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8174d52e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811d09aa>] kmem_cache_alloc_trace+0xea/0x1e0
    [<ffffffffa03620fb>] btrfs_alloc_dummy_fs_info+0x6b/0x2a0 [btrfs]
    [<ffffffffa03624fc>] btrfs_alloc_dummy_block_group+0x5c/0x120 [btrfs]
    [<ffffffffa0360aa9>] btrfs_test_free_space_cache+0x39/0xed0 [btrfs]
    [<ffffffffa03b5a74>] trace_raw_output_xfs_attr_class+0x54/0xe0 [xfs]
    [<ffffffff81002122>] do_one_initcall+0xb2/0x1f0
    [<ffffffff811765aa>] do_init_module+0x5e/0x1e9
    [<ffffffff810fec09>] load_module+0x20a9/0x2690
    [<ffffffff810ff439>] SyS_finit_module+0xb9/0xf0
    [<ffffffff81757daf>] entry_SYSCALL_64_fastpath+0x12/0x76
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff8800573f8000 (size 10256):
  comm "modprobe", pid 730, jiffies 4294690185 (age 196.460s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8174d52e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff8119ca6e>] kmalloc_order+0x5e/0x70
    [<ffffffff8119caa4>] kmalloc_order_trace+0x24/0x90
    [<ffffffffa03620b3>] btrfs_alloc_dummy_fs_info+0x23/0x2a0 [btrfs]
    [<ffffffffa03624fc>] btrfs_alloc_dummy_block_group+0x5c/0x120 [btrfs]
    [<ffffffffa036603d>] run_test+0xfd/0x320 [btrfs]
    [<ffffffffa0366f34>] btrfs_test_free_space_tree+0x94/0xee [btrfs]
    [<ffffffffa03b5aab>] trace_raw_output_xfs_attr_class+0x8b/0xe0 [xfs]
    [<ffffffff81002122>] do_one_initcall+0xb2/0x1f0
    [<ffffffff811765aa>] do_init_module+0x5e/0x1e9
    [<ffffffff810fec09>] load_module+0x20a9/0x2690
    [<ffffffff810ff439>] SyS_finit_module+0xb9/0xf0
    [<ffffffff81757daf>] entry_SYSCALL_64_fastpath+0x12/0x76
    [<ffffffffffffffff>] 0xffffffffffffffff

This patch lets btrfs using fs_info stored in btrfs_root for
block group cache directly without allocating a new one.

Fixes: d0bd456074 ("Btrfs: add fragment=* debug mount option")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 13:28:24 +01:00
Zhao Lei
4da2e26a2a btrfs: Continue write in case of can_not_nocow
btrfs failed in xfstests btrfs/080 with -o nodatacow.

Can be reproduced by following script:
  DEV=/dev/vdg
  MNT=/mnt/tmp

  umount $DEV &>/dev/null
  mkfs.btrfs -f $DEV
  mount -o nodatacow $DEV $MNT

  dd if=/dev/zero of=$MNT/test bs=1 count=2048 &
  btrfs subvolume snapshot -r $MNT $MNT/test_snap &
  wait
  --
  We can see dd failed on NO_SPACE.

Reason:
  __btrfs_buffered_write should run cow write when no_cow impossible,
  and current code is designed with above logic.
  But check_can_nocow() have 2 type of return value(0 and <0) on
  can_not_no_cow, and current code only continue write on first case,
  the second case happened in doing subvolume.

Fix:
  Continue write when check_can_nocow() return 0 and <0.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
2016-02-18 13:18:06 +01:00
Kinglong Mee
5598e9005a btrfs: drop null testing before destroy functions
Cleanup.

kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:46:03 +01:00
Sudip Mukherjee
89771cc98c btrfs: fix build warning
We were getting build warning about:
fs/btrfs/extent-tree.c:7021:34: warning: ‘used_bg’ may be used
	uninitialized in this function

It is not a valid warning as used_bg is never used uninitilized since
locked is initially false so we can never be in the section where
'used_bg' is used. But gcc is not able to understand that and we can
initialize it while declaring to silence the warning.

Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:46:03 +01:00
David Sterba
47dc196ae7 btrfs: use proper type for failrec in extent_state
We use the private member of extent_state to store the failrec and play
pointless pointer games.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:46:03 +01:00
Deepa Dinamani
04b285f35e btrfs: Replace CURRENT_TIME by current_fs_time()
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_fs_time() instead.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:46:03 +01:00
Dave Jones
8f682f6955 btrfs: remove open-coded swap() in backref.c:__merge_refs
The kernel provides a swap() that does the same thing as this code.

Signed-off-by: Dave Jones <dsj@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:45:55 +01:00
Byongho Lee
ac1407ba24 btrfs: remove redundant error check
While running btrfs_mksubvol(), d_really_is_positive() is called twice.
First in btrfs_mksubvol() and second inside btrfs_may_create().  So I
remove the first one.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:35:27 +01:00
Byongho Lee
0138b6fe8f btrfs: simplify expression in btrfs_calc_trans_metadata_size()
Simplify expression in btrfs_calc_trans_metadata_size().

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:33:17 +01:00
Josef Bacik
baee879064 Btrfs: check reserved when deciding to background flush
We will sometimes start background flushing the various enospc related things
(delayed nodes, delalloc, etc) if we are getting close to reserving all of our
available space.  We don't want to do this however when we are actually using
this space as it causes unneeded thrashing.  We currently try to do this by
checking bytes_used >= thresh, but bytes_used is only part of the equation, we
need to use bytes_reserved as well as this represents space that is very likely
to become bytes_used in the future.

My tracing tool will keep count of the number of times we kick off the async
flusher, the following are counts for the entire run of generic/027

		No Patch	Patch
avg: 		5385		5009
median:		5500		4916

We skewed lower than the average with my patch and higher than the average with
the patch, overall it cuts the flushing from anywhere from 5-10%, which in the
case of actual ENOSPC is quite helpful.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:29:43 +01:00
Josef Bacik
88d3a5aaf6 Btrfs: add transaction space reservation tracepoints
There are a few places where we add to trans->bytes_reserved but don't have the
corresponding trace point.  With these added my tool no longer sees transaction
leaks.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:22:41 +01:00
Josef Bacik
dc95f7bfc5 Btrfs: fix truncate_space_check
truncate_space_check is using btrfs_csum_bytes_to_leaves() but forgetting to
multiply by nodesize so we get an actual byte count.  We need a tracepoint here
so that we have the matching reserve for the release that will come later.  Also
add a comment to make clear what the intent of truncate_space_check is.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:22:24 +01:00
Josef Bacik
fb4b10e5d5 Btrfs: change how we update the global block rsv
I'm writing a tool to visualize the enospc system in order to help debug enospc
bugs and I found weird data and ran it down to when we update the global block
rsv.  We add all of the remaining free space to the block rsv, do a trace event,
then remove the extra and do another trace event.  This makes my visualization
look silly and is unintuitive code as well.  Fix this stuff to only add the
amount we are missing, or free the amount we are missing.  This is less clean to
read but more explicit in what it is doing, as well as only emitting events for
values that make sense.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 11:21:48 +01:00
Zhao Lei
7aff8cf4a6 btrfs: reada: ignore creating reada_extent for a non-existent device
For a non-existent device, old code bypasses adding it in dev's reada
queue.

And to solve problem of unfinished waitting in raid5/6,
commit 5fbc7c59fd ("Btrfs: fix unfinished readahead thread for
raid5/6 degraded mounting")
adding an exception for the first stripe, in short, the first
stripe will always be processed whether the device exists or not.

Actually we have a better way for the above request: just bypass
creation of the reada_extent for non-existent device, it will make
code simple and effective.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:27:23 +01:00
Zhao Lei
4fe7a0e138 btrfs: reada: avoid undone reada extents in btrfs_reada_wait
Reada background works is not designed to finish all jobs
completely, it will break in following case:
1: When a device reaches workload limit (MAX_IN_FLIGHT)
2: Total reads reach max limit (10000)
3: All devices don't have queued more jobs, often happened in DUP case

And if all background works exit with remaining jobs,
btrfs_reada_wait() will wait indefinetelly.

Above problem is rarely happened in old code, because:
1: Every work queues 2x new works
   So many works reduced chances of undone jobs.
2: One work will continue 10000 times loop in case of no-jobs
   It reduced no-thread window time.

But after we fixed above case, the "undone reada extents" frequently
happened.

Fix:
 Check to ensure we have at least one thread if there are undone jobs
 in btrfs_reada_wait().

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18 10:27:23 +01:00