Commit Graph

48450 Commits

Author SHA1 Message Date
Linus Torvalds
52281b38bc Merge tag 'pstore-v4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore updates from Kees Cook:
 "Improvements and fixes to pstore subsystem:

   - add additional checks for bad platform data

   - remove bounce buffer in console writer

   - protect read/unlink race with a mutex

   - correctly give up during dump locking failures

   - increase ftrace bandwidth by splitting ftrace buffers per CPU"

* tag 'pstore-v4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  ramoops: add pdata NULL check to ramoops_probe
  pstore: Convert console write to use ->write_buf
  pstore: Protect unlink with read_mutex
  pstore: Use global ftrace filters for function trace filtering
  ftrace: Provide API to use global filtering for ftrace ops
  pstore: Clarify context field przs as dprzs
  pstore: improve error report for failed setup
  pstore: Merge per-CPU ftrace records into one
  pstore: Add ftrace timestamp counter
  ramoops: Split ftrace buffer space into per-CPU zones
  pstore: Make ramoops_init_przs generic for other prz arrays
  pstore: Allow prz to control need for locking
  pstore: Warn on PSTORE_TYPE_PMSG using deprecated function
  pstore: Make spinlock per zone instead of global
  pstore: Actually give up during locking failure
2016-12-13 09:16:11 -08:00
Chris Mason
5f52a2c512 Merge branch 'for-chris-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.10
Patches queued up by Filipe:

The most important change is still the fix for the extent tree
corruption that happens due to balance when qgroups are enabled (a
regression introduced in 4.7 by a fix for a regression from the last
qgroups rework). This has been hitting SLE and openSUSE users and QA
very badly, where transactions keep getting aborted when running
delayed references leaving the root filesystem in RO mode and nearly
unusable.  There are fixes here that allow us to run xfstests again
with the integrity checker enabled, which has been impossible since 4.8
(apparently I'm the only one running xfstests with the integrity
checker enabled, which is useful to validate dirtied leafs, like
checking if there are keys out of order, etc).  The rest are just some
trivial fixes, most of them tagged for stable, and two cleanups.

Signed-off-by: Chris Mason <clm@fb.com>
2016-12-13 09:14:42 -08:00
Richard Weinberger
3858866866 ubifs: Use FS_CFLG_OWN_PAGES
Commit bd7b829038 ("fscrypt: Cleanup page locking requirements for
fscrypt_{decrypt,encrypt}_page()") renamed the flag.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-13 16:18:16 +01:00
Jan Kara
5716863e0f fsnotify: Fix possible use-after-free in inode iteration on umount
fsnotify_unmount_inodes() plays complex tricks to pin next inode in the
sb->s_inodes list when iterating over all inodes. Furthermore the code has a
bug that if the current inode is the last on i_sb_list that does not have e.g.
I_FREEING set, then we leave next_i pointing to inode which may get removed
from the i_sb_list once we drop s_inode_list_lock thus resulting in
use-after-free issues (usually manifesting as infinite looping in
fsnotify_unmount_inodes()).

Fix the problem by keeping current inode pinned somewhat longer. Then we can
make the code much simpler and standard.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2016-12-13 12:57:52 +01:00
Linus Torvalds
e7aa8c2eb1 Merge tag 'docs-4.10' of git://git.lwn.net/linux
Pull documentation update from Jonathan Corbet:
 "These are the documentation changes for 4.10.

  It's another busy cycle for the docs tree, as the sphinx conversion
  continues. Highlights include:

   - Further work on PDF output, which remains a bit of a pain but
     should be more solid now.

   - Five more DocBook template files converted to Sphinx. Only 27 to
     go... Lots of plain-text files have also been converted and
     integrated.

   - Images in binary formats have been replaced with more
     source-friendly versions.

   - Various bits of organizational work, including the renaming of
     various files discussed at the kernel summit.

   - New documentation for the device_link mechanism.

  ... and, of course, lots of typo fixes and small updates"

* tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits)
  dma-buf: Extract dma-buf.rst
  Update Documentation/00-INDEX
  docs: 00-INDEX: document directories/files with no docs
  docs: 00-INDEX: remove non-existing entries
  docs: 00-INDEX: add missing entries for documentation files/dirs
  docs: 00-INDEX: consolidate process/ and admin-guide/ description
  scripts: add a script to check if Documentation/00-INDEX is sane
  Docs: change sh -> awk in REPORTING-BUGS
  Documentation/core-api/device_link: Add initial documentation
  core-api: remove an unexpected unident
  ppc/idle: Add documentation for powersave=off
  Doc: Correct typo, "Introdution" => "Introduction"
  Documentation/atomic_ops.txt: convert to ReST markup
  Documentation/local_ops.txt: convert to ReST markup
  Documentation/assoc_array.txt: convert to ReST markup
  docs-rst: parse-headers.pl: cleanup the documentation
  docs-rst: fix media cleandocs target
  docs-rst: media/Makefile: reorganize the rules
  docs-rst: media: build SVG from graphviz files
  docs-rst: replace bayer.png by a SVG image
  ...
2016-12-12 21:58:13 -08:00
Linus Torvalds
e34bac726d Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - various misc bits

 - most of MM (quite a lot of MM material is awaiting the merge of
   linux-next dependencies)

 - kasan

 - printk updates

 - procfs updates

 - MAINTAINERS

 - /lib updates

 - checkpatch updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (123 commits)
  init: reduce rootwait polling interval time to 5ms
  binfmt_elf: use vmalloc() for allocation of vma_filesz
  checkpatch: don't emit unified-diff error for rename-only patches
  checkpatch: don't check c99 types like uint8_t under tools
  checkpatch: avoid multiple line dereferences
  checkpatch: don't check .pl files, improve absolute path commit log test
  scripts/checkpatch.pl: fix spelling
  checkpatch: don't try to get maintained status when --no-tree is given
  lib/ida: document locking requirements a bit better
  lib/rbtree.c: fix typo in comment of ____rb_erase_color
  lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM
  MAINTAINERS: add drm and drm/i915 irc channels
  MAINTAINERS: add "C:" for URI for chat where developers hang out
  MAINTAINERS: add drm and drm/i915 bug filing info
  MAINTAINERS: add "B:" for URI where to file bugs
  get_maintainer: look for arbitrary letter prefixes in sections
  printk: add Kconfig option to set default console loglevel
  printk/sound: handle more message headers
  printk/btrfs: handle more message headers
  printk/kdb: handle more message headers
  ...
2016-12-12 20:50:02 -08:00
Linus Torvalds
9465d9cc31 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The time/timekeeping/timer folks deliver with this update:

   - Fix a reintroduced signed/unsigned issue and cleanup the whole
     signed/unsigned mess in the timekeeping core so this wont happen
     accidentaly again.

   - Add a new trace clock based on boot time

   - Prevent injection of random sleep times when PM tracing abuses the
     RTC for storage

   - Make posix timers configurable for real tiny systems

   - Add tracepoints for the alarm timer subsystem so timer based
     suspend wakeups can be instrumented

   - The usual pile of fixes and updates to core and drivers"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  timekeeping: Use mul_u64_u32_shr() instead of open coding it
  timekeeping: Get rid of pointless typecasts
  timekeeping: Make the conversion call chain consistently unsigned
  timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion
  alarmtimer: Add tracepoints for alarm timers
  trace: Update documentation for mono, mono_raw and boot clock
  trace: Add an option for boot clock as trace clock
  timekeeping: Add a fast and NMI safe boot clock
  timekeeping/clocksource_cyc2ns: Document intended range limitation
  timekeeping: Ignore the bogus sleep time if pm_trace is enabled
  selftests/timers: Fix spelling mistake "Asyncrhonous" -> "Asynchronous"
  clocksource/drivers/bcm2835_timer: Unmap region obtained by of_iomap
  clocksource/drivers/arm_arch_timer: Map frame with of_io_request_and_map()
  arm64: dts: rockchip: Arch counter doesn't tick in system suspend
  clocksource/drivers/arm_arch_timer: Don't assume clock runs in suspend
  posix-timers: Make them configurable
  posix_cpu_timers: Move the add_device_randomness() call to a proper place
  timer: Move sys_alarm from timer.c to itimer.c
  ptp_clock: Allow for it to be optional
  Kconfig: Regenerate *.c_shipped files after previous changes
  ...
2016-12-12 19:56:15 -08:00
Linus Torvalds
e71c3978d6 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp hotplug updates from Thomas Gleixner:
 "This is the final round of converting the notifier mess to the state
  machine. The removal of the notifiers and the related infrastructure
  will happen around rc1, as there are conversions outstanding in other
  trees.

  The whole exercise removed about 2000 lines of code in total and in
  course of the conversion several dozen bugs got fixed. The new
  mechanism allows to test almost every hotplug step standalone, so
  usage sites can exercise all transitions extensively.

  There is more room for improvement, like integrating all the
  pointlessly different architecture mechanisms of synchronizing,
  setting cpus online etc into the core code"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  tracing/rb: Init the CPU mask on allocation
  soc/fsl/qbman: Convert to hotplug state machine
  soc/fsl/qbman: Convert to hotplug state machine
  zram: Convert to hotplug state machine
  KVM/PPC/Book3S HV: Convert to hotplug state machine
  arm64/cpuinfo: Convert to hotplug state machine
  arm64/cpuinfo: Make hotplug notifier symmetric
  mm/compaction: Convert to hotplug state machine
  iommu/vt-d: Convert to hotplug state machine
  mm/zswap: Convert pool to hotplug state machine
  mm/zswap: Convert dst-mem to hotplug state machine
  mm/zsmalloc: Convert to hotplug state machine
  mm/vmstat: Convert to hotplug state machine
  mm/vmstat: Avoid on each online CPU loops
  mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead()
  tracing/rb: Convert to hotplug state machine
  oprofile/nmi timer: Convert to hotplug state machine
  net/iucv: Use explicit clean up labels in iucv_init()
  x86/pci/amd-bus: Convert to hotplug state machine
  x86/oprofile/nmi: Convert to hotplug state machine
  ...
2016-12-12 19:25:04 -08:00
Jason Baron
30f74aa085 binfmt_elf: use vmalloc() for allocation of vma_filesz
We have observed page allocations failures of order 4 during core dump
while trying to allocate vma_filesz.  This results in a useless core
file of size 0.  To improve reliability use vmalloc().

Note that the vmalloc() allocation is bounded by sysctl_max_map_count,
which is 65,530 by default.  So with a 4k page size, and 8 bytes per
seg, this is a max of 128 pages or an order 7 allocation.  Other parts
of the core dump path, such as fill_files_note() are already using
vmalloc() for presumably similar reasons.

Link: http://lkml.kernel.org/r/1479745791-17611-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:10 -08:00
Petr Mladek
262c5e86fe printk/btrfs: handle more message headers
Commit 4bcc595ccd ("printk: reinstate KERN_CONT for printing
continuation lines") allows to define more message headers for a single
message.  The motivation is that continuous lines might get mixed.
Therefore it make sense to define the right log level for every piece of
a cont line.

The current btrfs_printk() macros do not support continuous lines at the
moment.  But better be prepared for a custom messages and avoid
potential "lvl" buffer overflow.

This patch iterates over the entire message header.  It is interested
only into the message level like the original code.

This patch also introduces PRINTK_MAX_SINGLE_HEADER_LEN.  Three bytes
are enough for the message level header at the moment.  But it used to
be three, see the commit 04d2c8c83d ("printk: convert the format for
KERN_<LEVEL> to a 2 byte pattern").

Also I fixed the default ratelimit level.  It looked very strange when it
was different from the default log level.

[pmladek@suse.com: Fix a check of the valid message level]
  Link: http://lkml.kernel.org/r/20161111183236.GD2145@dhcp128.suse.cz
Link: http://lkml.kernel.org/r/1478695291-12169-4-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Joe Perches <joe@perches.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
1270dd8d99 fs/proc: calculate /proc/* and /proc/*/task/* nlink at init time
Runtime nlink calculation works but meh.  I don't know how to do it at
compile time, but I know how to do it at init time.

Shift "2+" part into init time as a bonus.

Link: http://lkml.kernel.org/r/20161122195549.GB29812@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
bac5f5d56b fs/proc/base.c: save decrement during lookup/readdir in /proc/$PID
Comparison for "<" works equally well as comparison for "<=" but one
SUB/LEA is saved (no, it is not optimised away, at least here).

Link: http://lkml.kernel.org/r/20161122195143.GA29812@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Rasmus Villemoes
209b14dc03 fs/proc/array.c: slightly improve render_sigset_t
format_decode and vsnprintf occasionally show up in perf top, so I went
looking for places that might not need the full printf power.  With the
help of kprobes, I gathered some statistics on which format strings we
mostly pass to vsnprintf.  On a trivial desktop workload, I hit "%x" 25%
of the time, so something apparently reads /proc/pid/status (which does
5*16 printf("%x") calls) a lot.

With this patch, reading /proc/pid/status is 30% faster according to
this microbenchmark:

	char buf[4096];
	int i, fd;
	for (i = 0; i < 10000; ++i) {
		fd = open("/proc/self/status", O_RDONLY);
		read(fd, buf, sizeof(buf));
		close(fd);
	}

Link: http://lkml.kernel.org/r/1474410485-1305-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Andrei Vagin <avagin@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
492b2da605 proc: tweak comments about 2 stage open and everything
Some comments were obsoleted since commit 05c0ae21c0 ("try a saner
locking for pde_opener...").

Some new comments added.

Some confusing comments replaced with equally confusing ones.

Link: http://lkml.kernel.org/r/20161029160231.GD1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
39a10ac23c proc: kmalloc struct pde_opener
kzalloc is too much, half of the fields will be reinitialized anyway.

If proc file doesn't have ->release hook (some still do not), clearing
is unnecessary because it will be freed immediately.

Link: http://lkml.kernel.org/r/20161029155747.GC1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
f5887c71cf proc: fix type of struct pde_opener::closing field
struct pde_opener::closing is boolean.

Link: http://lkml.kernel.org/r/20161029155439.GB1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
06a0c4175d proc: just list_del() struct pde_opener
list_del_init() is too much, structure will be freed in three lines
anyway.

Link: http://lkml.kernel.org/r/20161029155313.GA1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
9a87fe0d7c proc: make struct struct map_files_info::len unsigned int
Linux doesn't support 4GB+ filenames in /proc, so unsigned long is too
much.

MOV r64, r/m64 is larger than MOV r32, r/m32.

Link: http://lkml.kernel.org/r/20161029161123.GG1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
623f594e7d proc: make struct pid_entry::len unsigned
"unsigned int" is better on x86_64 because it most of the time it
autoexpands to 64-bit value while "int" requires MOVSX instruction.

Link: http://lkml.kernel.org/r/20161029160810.GF1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Kees Cook
af884cd4a5 proc: report no_new_privs state
Similar to being able to examine if a process has been correctly
confined with seccomp, the state of no_new_privs is equally interesting,
so this adds it to /proc/$pid/status.

Link: http://lkml.kernel.org/r/20161103214041.GA58566@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jann Horn <jann@thejh.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rodrigo Freire <rfreire@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Robert Ho <robert.hu@intel.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Richard W.M. Jones" <rjones@redhat.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Hugh Dickins
a66c0410b9 mm: add cond_resched() in gather_pte_stats()
The other pagetable walks in task_mmu.c have a cond_resched() after
walking their ptes: add a cond_resched() in gather_pte_stats() too, for
reading /proc/<id>/numa_maps.  Only pagemap_pmd_range() has a
cond_resched() in its (unusually expensive) pmd_trans_huge case: more
should probably be added, but leave them unchanged for now.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052157400.13021@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Johannes Weiner
4d693d0860 lib: radix-tree: update callback for changing leaf nodes
Support handing __radix_tree_replace() a callback that gets invoked for
all leaf nodes that change or get freed as a result of the slot
replacement, to assist users tracking nodes with node->private_list.

This prepares for putting page cache shadow entries into the radix tree
root again and drastically simplifying the shadow tracking.

Link: http://lkml.kernel.org/r/20161117193134.GD23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:08 -08:00
Johannes Weiner
6d75f366b9 lib: radix-tree: check accounting of existing slot replacement users
The bug in khugepaged fixed earlier in this series shows that radix tree
slot replacement is fragile; and it will become more so when not only
NULL<->!NULL transitions need to be caught but transitions from and to
exceptional entries as well.  We need checks.

Re-implement radix_tree_replace_slot() on top of the sanity-checked
__radix_tree_replace().  This requires existing callers to also pass the
radix tree root, but it'll warn us when somebody replaces slots with
contents that need proper accounting (transitions between NULL entries,
real entries, exceptional entries) and where a replacement through the
slot pointer would corrupt the radix tree node counts.

Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:08 -08:00
Johannes Weiner
f7942430e4 lib: radix-tree: native accounting of exceptional entries
The way the page cache is sneaking shadow entries of evicted pages into
the radix tree past the node entry accounting and tracking them manually
in the upper bits of node->count is fraught with problems.

These shadow entries are marked in the tree as exceptional entries,
which are a native concept to the radix tree.  Maintain an explicit
counter of exceptional entries in the radix tree node.  Subsequent
patches will switch shadow entry tracking over to that counter.

DAX and shmem are the other users of exceptional entries.  Since slot
replacements that change the entry type from regular to exceptional must
now be accounted, introduce a __radix_tree_replace() function that does
replacement and accounting, and switch DAX and shmem over.

The increase in radix tree node size is temporary.  A followup patch
switches the shadow tracking to this new scheme and we'll no longer need
the upper bits in node->count and shrink that back to one byte.

Link: http://lkml.kernel.org/r/20161117192945.GA23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:08 -08:00
Tahsin Erdogan
bace924818 fs/fs-writeback.c: remove redundant if check
b_more_io non-empty check is already preceded by an opposite check.

Link: http://lkml.kernel.org/r/1478591249-30641-1-git-send-email-tahsin@google.com
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:08 -08:00
Deepa Dinamani
c62c38f6b9 ocfs2: replace CURRENT_TIME macro
CURRENT_TIME is not y2038 safe.

Use y2038 safe ktime_get_real_seconds() here for timestamps.  struct
heartbeat_block's hb_seq and deletetion time are already 64 bits wide
and accommodate times beyond y2038.

Also use y2038 safe ktime_get_real_ts64() for on disk inode timestamps.
These are also wide enough to accommodate time64_t.

Link: http://lkml.kernel.org/r/1475365298-29236-1-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Deepa Dinamani
395627b071 ocfs2: use time64_t to represent orphan scan times
struct timespec is not y2038 safe.  Use time64_t which is y2038 safe to
represent orphan scan times.  time64_t is sufficient here as only the
seconds delta times are relevant.

Also use appropriate time functions that return time in time64_t format.
Time functions now return monotonic time instead of real time as only
delta scan times are relevant and these values are not persistent across
reboots.

The format string for the debug print is still using long as this is
only the time elapsed since the last scan and long is sufficient to
represent this value.

Link: http://lkml.kernel.org/r/1475365138-20567-1-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Ashish Samant
4131d53810 ocfs2: fix double put of recount tree in ocfs2_lock_refcount_tree()
In ocfs2_lock_refcount_tree, if ocfs2_read_refcount_block() returns an
error, we do ocfs2_refcount_tree_put twice (once in
ocfs2_unlock_refcount_tree and once outside it), thereby reducing the
refcount of the refcount tree twice, but we dont delete the tree in this
case.  This will make refcnt of the tree = 0 and the
ocfs2_refcount_tree_put will eventually call ocfs2_mark_lockres_freeing,
setting OCFS2_LOCK_FREEING for the refcount_tree->rf_lockres.

The error returned by ocfs2_read_refcount_block is propagated all the
way back and for next iteration of write, ocfs2_lock_refcount_tree gets
the same tree back from ocfs2_get_refcount_tree because we havent
deleted the tree.  Now we have the same tree, but OCFS2_LOCK_FREEING is
set for rf_lockres and eventually, when _ocfs2_lock_refcount_tree is
called in this iteration, BUG_ON( __ocfs2_cluster_lock:1395 ERROR:
Cluster lock called on freeing lockres T00000000000000000386019775b08d!
flags 0x81) is triggerred.

Call stack:

  (loop16,11155,0):ocfs2_lock_refcount_tree:482 ERROR: status = -5
  (loop16,11155,0):ocfs2_refcount_cow_hunk:3497 ERROR: status = -5
  (loop16,11155,0):ocfs2_refcount_cow:3560 ERROR: status = -5
  (loop16,11155,0):ocfs2_prepare_inode_for_refcount:2111 ERROR: status = -5
  (loop16,11155,0):ocfs2_prepare_inode_for_write:2190 ERROR: status = -5
  (loop16,11155,0):ocfs2_file_write_iter:2331 ERROR: status = -5
  (loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: bug expression:
  lockres->l_flags & OCFS2_LOCK_FREEING

  (loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: Cluster lock called on
  freeing lockres T00000000000000000386019775b08d! flags 0x81

  kernel BUG at fs/ocfs2/dlmglue.c:1395!

  invalid opcode: 0000 [#1] SMP  CPU 0
  Modules linked in: tun ocfs2 jbd2 xen_blkback xen_netback xen_gntdev .. sd_mod crc_t10dif ext3 jbd mbcache
  RIP: __ocfs2_cluster_lock+0x31c/0x740 [ocfs2]
  RSP: e02b:ffff88017c0138a0  EFLAGS: 00010086
  Process loop16 (pid: 11155, threadinfo ffff88017c010000, task ffff8801b5374300)
  Call Trace:
     ocfs2_refcount_lock+0xae/0x130 [ocfs2]
     __ocfs2_lock_refcount_tree+0x29/0xe0 [ocfs2]
     ocfs2_lock_refcount_tree+0xdd/0x320 [ocfs2]
     ocfs2_refcount_cow_hunk+0x1cb/0x440 [ocfs2]
     ocfs2_refcount_cow+0xa9/0x1d0 [ocfs2]
     ocfs2_prepare_inode_for_refcount+0x115/0x200 [ocfs2]
     ocfs2_prepare_inode_for_write+0x33b/0x470 [ocfs2]
     ocfs2_file_write_iter+0x220/0x8c0 [ocfs2]
     aio_write_iter+0x2e/0x30

Fix this by avoiding the second call to ocfs2_refcount_tree_put()

Link: http://lkml.kernel.org/r/1473984404-32011-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Eric Ren <zren@suse.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
piaojun
07f38d971c ocfs2: clean up unused 'page' parameter in ocfs2_write_end_nolock()
'page' parameter in ocfs2_write_end_nolock() is never used.

Link: http://lkml.kernel.org/r/582FD91A.5000902@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
piaojun
28bb5ef485 ocfs2/dlm: clean up deadcode in dlm_master_request_handler()
When 'dispatch_assert' is set, 'response' must be DLM_MASTER_RESP_YES,
and 'res' won't be null, so execution can't reach these two branch.

Link: http://lkml.kernel.org/r/58174C91.3040004@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Guozhonghua
aa7b58597f ocfs2: delete redundant code and set the node bit into maybe_map directly
The variable `set_maybe' is redundant when the mle has been found in the
map.  So it is ok to set the node_idx into mle's maybe_map directly.

Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4A3D490DD@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: Guozhonghua <guozhonghua@h3c.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
piaojun
46832b2de5 ocfs2/dlm: clean up useless BUG_ON default case in dlm_finalize_reco_handler()
The value of 'stage' must be between 1 and 2, so the switch can't reach
the default case.

Link: http://lkml.kernel.org/r/57FB5EB2.7050002@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Theodore Ts'o
a551d7c8de Merge branch 'fscrypt' into dev 2016-12-12 21:50:28 -05:00
Jan Kara
0cb80b4847 dax: Fix sleep in atomic contex in grab_mapping_entry()
Commit 642261ac99: "dax: add struct iomap based DAX PMD support" has
introduced unmapping of page tables if huge page needs to be split in
grab_mapping_entry(). However the unmapping happens after
radix_tree_preload() call which disables preemption and thus
unmap_mapping_range() tries to acquire i_mmap_lock in atomic context
which is a bug. Fix the problem by moving unmapping before
radix_tree_preload() call.

Fixes: 642261ac99
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-12 21:34:12 -05:00
Yan, Zheng
dc24de82d6 ceph: properly set issue_seq for cap release
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
1e4ef0c633 ceph: add flags parameter to send_cap_msg
Add a flags parameter to send_cap_msg, so we can request expedited
service from the MDS when we know we'll be waiting on the result.

Set that flag in the case of try_flush_caps. The callers of that
function generally wait synchronously on the result, so it's beneficial
to ask the server to expedite it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
43b2967330 ceph: update cap message struct version to 10
The userland ceph has MClientCaps at struct version 10. This brings the
kernel up the same version.

For now, all of the the new stuff is set to default values including
the flags field, which will be conditionally set in a later patch.

Note that we don't need to set the change_attr and btime to anything
since we aren't currently setting the feature flag. The MDS should
ignore those values.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
0ff8bfb394 ceph: define new argument structure for send_cap_msg
When we get to this many arguments, it's hard to work with positional
parameters. send_cap_msg is already at 25 arguments, with more needed.

Define a new args structure and pass a pointer to it to send_cap_msg.
Eventually it might make sense to embed one of these inside
ceph_cap_snap instead of tracking individual fields.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
9670079f5f ceph: move xattr initialzation before the encoding past the ceph_mds_caps
Just for clarity. This part is inside the header, so it makes sense to
group it with the rest of the stuff in the header.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
4945a08479 ceph: fix minor typo in unsafe_request_wait
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Yan, Zheng
5f743e4566 ceph: record truncate size/seq for snap data writeback
Dirty snapshot data needs to be flushed unconditionally. If they
were created before truncation, writeback should use old truncate
size/seq.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Yan, Zheng
e9e427f0a1 ceph: check availability of mds cluster on mount
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Yan, Zheng
7ce469a53e ceph: fix splice read for no Fc capability case
When iov_iter type is ITER_PIPE, copy_page_to_iter() increases
the page's reference and add the page to a pipe_buffer. It also
set the pipe_buffer's ops to page_cache_pipe_buf_ops. The comfirm
callback in page_cache_pipe_buf_ops expects the page is from page
cache and uptodate, otherwise it return error.

For ceph_sync_read() case, pages are not from page cache. So we
can't call copy_page_to_iter() when iov_iter type is ITER_PIPE.
The fix is using iov_iter_get_pages_alloc() to allocate pages
for the pipe. (the code is similar to default_file_splice_read)

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Yan, Zheng
2b1ac852eb ceph: try getting buffer capability for readahead/fadvise
For readahead/fadvise cases, caller of ceph_readpages does not
hold buffer capability. Pages can be added to page cache while
there is no buffer capability. This can cause data integrity
issue.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Nikolay Borisov
5c341ee328 ceph: fix scheduler warning due to nested blocking
try_get_cap_refs can be used as a condition in a wait_event* calls.
This is all fine until it has to call __ceph_do_pending_vmtruncate,
which in turn acquires the i_truncate_mutex. This leads to a situation
in which a task's state is !TASK_RUNNING and at the same time it's
trying to acquire a sleeping primitive. In essence a nested sleeping
primitives are being used. This causes the following warning:

WARNING: CPU: 22 PID: 11064 at kernel/sched/core.c:7631 __might_sleep+0x9f/0xb0()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8109447d>] prepare_to_wait_event+0x5d/0x110
 ipmi_msghandler tcp_scalable ib_qib dca ib_mad ib_core ib_addr ipv6
CPU: 22 PID: 11064 Comm: fs_checker.pl Tainted: G           O    4.4.20-clouder2 #6
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1a 10/16/2015
 0000000000000000 ffff8838b416fa88 ffffffff812f4409 ffff8838b416fad0
 ffffffff81a034f2 ffff8838b416fac0 ffffffff81052b46 ffffffff81a0432c
 0000000000000061 0000000000000000 0000000000000000 ffff88167bda54a0
Call Trace:
 [<ffffffff812f4409>] dump_stack+0x67/0x9e
 [<ffffffff81052b46>] warn_slowpath_common+0x86/0xc0
 [<ffffffff81052bcc>] warn_slowpath_fmt+0x4c/0x50
 [<ffffffff8109447d>] ? prepare_to_wait_event+0x5d/0x110
 [<ffffffff8109447d>] ? prepare_to_wait_event+0x5d/0x110
 [<ffffffff8107767f>] __might_sleep+0x9f/0xb0
 [<ffffffff81612d30>] mutex_lock+0x20/0x40
 [<ffffffffa04eea14>] __ceph_do_pending_vmtruncate+0x44/0x1a0 [ceph]
 [<ffffffffa04fa692>] try_get_cap_refs+0xa2/0x320 [ceph]
 [<ffffffffa04fd6f5>] ceph_get_caps+0x255/0x2b0 [ceph]
 [<ffffffff81094370>] ? wait_woken+0xb0/0xb0
 [<ffffffffa04f2c11>] ceph_write_iter+0x2b1/0xde0 [ceph]
 [<ffffffff81613f22>] ? schedule_timeout+0x202/0x260
 [<ffffffff8117f01a>] ? kmem_cache_free+0x1ea/0x200
 [<ffffffff811b46ce>] ? iput+0x9e/0x230
 [<ffffffff81077632>] ? __might_sleep+0x52/0xb0
 [<ffffffff81156147>] ? __might_fault+0x37/0x40
 [<ffffffff8119e123>] ? cp_new_stat+0x153/0x170
 [<ffffffff81198cfa>] __vfs_write+0xaa/0xe0
 [<ffffffff81199369>] vfs_write+0xa9/0x190
 [<ffffffff811b6d01>] ? set_close_on_exec+0x31/0x70
 [<ffffffff8119a056>] SyS_write+0x46/0xa0

This happens since wait_event_interruptible can interfere with the
mutex locking code, since they both fiddle with the task state.

Fix the issue by using the newly-added nested blocking infrastructure
in 61ada528de ("sched/wait: Provide infrastructure to deal with
nested blocking")

Link: https://lwn.net/Articles/628628/
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Zhi Zhang
a380a031cb ceph: fix printing wrong return variable in ceph_direct_read_write()
Fix printing wrong return variable for invalidate_inode_pages2_range in
ceph_direct_read_write().

Signed-off-by: Zhi Zhang <zhang.david2011@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-12 23:54:27 +01:00
Ilya Dryomov
0dde584882 libceph: drop len argument of *verify_authorizer_reply()
The length of the reply is protocol-dependent - for cephx it's
ceph_x_authorize_reply.  Nothing sensible can be passed from the
messenger layer anyway.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:21 +01:00
Richard Weinberger
fc4b891bbe ubifs: Raise write version to 5
Starting with version 5 the following properties change:
 - UBIFS_FLG_DOUBLE_HASH is mandatory
 - UBIFS_FLG_ENCRYPTION is optional but depdens on UBIFS_FLG_DOUBLE_HASH
 - Filesystems with unknown super block flags will be rejected, this
   allows us in future to add new features without raising the UBIFS
   write version.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
e021986ee4 ubifs: Implement UBIFS_FLG_ENCRYPTION
This feature flag indicates that the filesystem contains encrypted
files.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
d63d61c169 ubifs: Implement UBIFS_FLG_DOUBLE_HASH
This feature flag indicates that all directory entry nodes have a 32bit
cookie set and therefore UBIFS is allowed to perform lookups by hash.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00