Commit Graph

56429 Commits

Author SHA1 Message Date
Chao Yu
2a510c005c f2fs: introduce __wait_one_discard_bio
In order to avoid copied codes.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:36 -07:00
Qiuyang Sun
5a3a2d83cd f2fs: dax: fix races between page faults and truncating pages
Currently in F2FS, page faults and operations that truncate the pagecahe
or data blocks, are completely unsynchronized. This can result in page
fault faulting in a page into a range that we are changing after
truncating, and thus we can end up with a page mapped to disk blocks that
will be shortly freed. Filesystem corruption will shortly follow.

This patch fixes the problem by creating new rw semaphore i_mmap_sem in
f2fs_inode_info and grab it for functions removing blocks from extent tree
and for read over page faults. The mechanism is similar to that in ext4.

Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:35 -07:00
Fan Li
72fdbe2efe f2fs: simplify the way of calulating next nat address
The index of segment which the next nat block is in has only one different
bit than the current one, so to get the next nat address, we can simply
alter that one bit.

Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:34 -07:00
Jin Qian
21d3f8e1c3 f2fs: sanity check size of nat and sit cache
Make sure number of entires doesn't exceed max journal size.

Cc: stable@vger.kernel.org
Signed-off-by: Jin Qian <jinqian@android.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:34 -07:00
Yunlei He
d4fdf8ba0e f2fs: fix a panic caused by NULL flush_cmd_control
Mount fs with option noflush_merge, boot failed for illegal address
fcc in function f2fs_issue_flush:

        if (!test_opt(sbi, FLUSH_MERGE)) {
                ret = submit_flush_wait(sbi);
                atomic_inc(&fcc->issued_flush);   ->  Here, fcc illegal
                return ret;
        }

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:33 -07:00
Zhang Shengju
68390dd9bd f2fs: remove the unnecessary cast for PTR_ERR
It's not necessary to specify 'int' casting for PTR_ERR.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:32 -07:00
Jaegeuk Kim
d8c4256c17 f2fs: remove false-positive bug_on
For example,

f2fs_create
 - new_node_page is failed
 - handle_failed_inode
  - skip to add it into orphan list, since ni.blk_addr == NULL_ADDR
   : set_inode_flag(inode, FI_FREE_NID)

f2fs_evict_inode
 - EIO due to fault injection
 - f2fs_bug_on() is triggered

So, we don't need to call f2fs_bug_on in this case.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:11:31 -07:00
Damien Le Moal
acfd2810c7 f2fs: Do not issue small discards in LFS mode
clear_prefree_segments() issues small discards after discarding full
segments. These small discards may not be section aligned, so not zone
aligned on a zoned block device, causing __f2fs_iissue_discard_zone() to fail.
Fix this by not issuing small discards for a volume mounted with the BLKZONED
feature enabled.

Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-04 02:10:42 -07:00
Linus Torvalds
650fc870a2 Merge tag 'docs-4.13' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
 "There has been a fair amount of activity in the docs tree this time
  around. Highlights include:

   - Conversion of a bunch of security documentation into RST

   - The conversion of the remaining DocBook templates by The Amazing
     Mauro Machine. We can now drop the entire DocBook build chain.

   - The usual collection of fixes and minor updates"

* tag 'docs-4.13' of git://git.lwn.net/linux: (90 commits)
  scripts/kernel-doc: handle DECLARE_HASHTABLE
  Documentation: atomic_ops.txt is core-api/atomic_ops.rst
  Docs: clean up some DocBook loose ends
  Make the main documentation title less Geocities
  Docs: Use kernel-figure in vidioc-g-selection.rst
  Docs: fix table problems in ras.rst
  Docs: Fix breakage with Sphinx 1.5 and upper
  Docs: Include the Latex "ifthen" package
  doc/kokr/howto: Only send regression fixes after -rc1
  docs-rst: fix broken links to dynamic-debug-howto in kernel-parameters
  doc: Document suitability of IBM Verse for kernel development
  Doc: fix a markup error in coding-style.rst
  docs: driver-api: i2c: remove some outdated information
  Documentation: DMA API: fix a typo in a function name
  Docs: Insert missing space to separate link from text
  doc/ko_KR/memory-barriers: Update control-dependencies example
  Documentation, kbuild: fix typo "minimun" -> "minimum"
  docs: Fix some formatting issues in request-key.rst
  doc: ReSTify keys-trusted-encrypted.txt
  doc: ReSTify keys-request-key.txt
  ...
2017-07-03 21:13:25 -07:00
Tahsin Erdogan
407cd7fb83 ext4: change fast symlink test to not rely on i_blocks
ext4_inode_info->i_data is the storage area for 4 types of data:

  a) Extents data
  b) Inline data
  c) Block map
  d) Fast symlink data (symlink length < 60)

Extents data case is positively identified by EXT4_INODE_EXTENTS flag.
Inline data case is also obvious because of EXT4_INODE_INLINE_DATA
flag.

Distinguishing c) and d) however requires additional logic. This
currently relies on i_blocks count. After subtracting external xattr
block from i_blocks, if it is greater than 0 then we know that some
data blocks exist, so there must be a block map.

This logic got broken after ea_inode feature was added. That feature
charges the data blocks of external xattr inodes to the referencing
inode and so adds them to the i_blocks. To fix this, we could subtract
ea_inode blocks by iterating through all xattr entries and then check
whether remaining i_blocks count is zero. Besides being complicated,
this won't change the fact that the current way of distinguishing
between c) and d) is fragile.

The alternative solution is to test whether i_size is less than 60 to
determine fast symlink case. ext4_symlink() uses the same test to decide
whether to store the symlink in i_data. There is one caveat to address
before this can work though.

If an inode's i_nlink is zero during eviction, its i_size is set to
zero and its data is truncated. If system crashes before inode is removed
from the orphan list, next boot orphan cleanup may find the inode with
zero i_size. So, a symlink that had its data stored in a block may now
appear to be a fast symlink. The solution used in this patch is to treat
i_size = 0 as a non-fast symlink case. A zero sized symlink is not legal
so the only time this can happen is the mentioned scenario. This is also
logically correct because a i_size = 0 symlink has no data stored in
i_data.

Suggested-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2017-07-04 00:11:21 -04:00
Linus Torvalds
9a715cd543 Merge tag 'tty-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial updates from Greg KH:
 "Here is the large tty/serial patchset for 4.13-rc1.

  A lot of tty and serial driver updates are in here, along with some
  fixups for some __get/put_user usages that were reported. Nothing
  huge, just lots of development by a number of different developers,
  full details in the shortlog.

  All of these have been in linux-next for a while"

* tag 'tty-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (71 commits)
  tty: serial: lpuart: add a more accurate baud rate calculation method
  tty: serial: lpuart: add earlycon support for imx7ulp
  tty: serial: lpuart: add imx7ulp support
  dt-bindings: serial: fsl-lpuart: add i.MX7ULP support
  tty: serial: lpuart: add little endian 32 bit register support
  tty: serial: lpuart: refactor lpuart32_{read|write} prototype
  tty: serial: lpuart: introduce lpuart_soc_data to represent SoC property
  serial: imx-serial - move DMA buffer configuration to DT
  serial: imx: Enable RTSD only when needed
  serial: imx: Remove unused members from imx_port struct
  serial: 8250: 8250_omap: Fix race b/w dma completion and RX timeout
  serial: 8250: Fix THRE flag usage for CAP_MINI
  tty/serial: meson_uart: update to stable bindings
  dt-bindings: serial: Add bindings for the Amlogic Meson UARTs
  serial: Delete dead code for CIR serial ports
  serial: sirf: make of_device_ids const
  serial/mpsc: switch to dma_alloc_attrs
  tty: serial: Add Actions Semi Owl UART earlycon
  dt-bindings: serial: Document Actions Semi Owl UARTs
  tty/serial: atmel: make the driver DT only
  ...
2017-07-03 20:04:16 -07:00
Miklos Szeredi
7f53b7d047 Merge tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid into overlayfs-next
UUID/GUID updates:

 - introduce the new uuid_t/guid_t types that are going to replace
   the somewhat confusing uuid_be/uuid_le types and make the terminology
   fit the various specs, as well as the userspace libuuid library.
   (me, based on a previous version from Amir)
 - consolidated generic uuid/guid helper functions lifted from XFS
   and libnvdimm (Amir and me)
 - conversions to the new types and helpers (Amir, Andy and me)
2017-07-04 04:05:05 +02:00
Dan Williams
9d92573fff Merge branch 'for-4.13/dax' into libnvdimm-for-next 2017-07-03 16:54:58 -07:00
Al Viro
468138d785 binfmt_flat: flat_{get,put}_addr_from_rp() should be able to fail
on MMU targets EFAULT is possible here.  Make both return 0 or error,
passing what used to be the return value of flat_get_addr_from_rp()
by reference.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-03 18:44:02 -04:00
Linus Torvalds
9bd42183b9 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
     debug checks earlier into the bootup. This turns silent and
     sporadically deadly bugs into nice, deterministic splats. Fix some
     of the splats that triggered. (Thomas Gleixner)

   - A round of restructuring and refactoring of the load-balancing and
     topology code (Peter Zijlstra)

   - Another round of consolidating ~20 of incremental scheduler code
     history: this time in terms of wait-queue nomenclature. (I didn't
     get much feedback on these renaming patches, and we can still
     easily change any names I might have misplaced, so if anyone hates
     a new name, please holler and I'll fix it.) (Ingo Molnar)

   - sched/numa improvements, fixes and updates (Rik van Riel)

   - Another round of x86/tsc scheduler clock code improvements, in hope
     of making it more robust (Peter Zijlstra)

   - Improve NOHZ behavior (Frederic Weisbecker)

   - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
     Bristot de Oliveira)

   - Simplify and optimize the topology setup code (Lauro Ramos
     Venancio)

   - Debloat and decouple scheduler code some more (Nicolas Pitre)

   - Simplify code by making better use of llist primitives (Byungchul
     Park)

   - ... plus other fixes and improvements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
  sched/cputime: Refactor the cputime_adjust() code
  sched/debug: Expose the number of RT/DL tasks that can migrate
  sched/numa: Hide numa_wake_affine() from UP build
  sched/fair: Remove effective_load()
  sched/numa: Implement NUMA node level wake_affine()
  sched/fair: Simplify wake_affine() for the single socket case
  sched/numa: Override part of migrate_degrades_locality() when idle balancing
  sched/rt: Move RT related code from sched/core.c to sched/rt.c
  sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
  sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
  sched/fair: Spare idle load balancing on nohz_full CPUs
  nohz: Move idle balancer registration to the idle path
  sched/loadavg: Generalize "_idle" naming to "_nohz"
  sched/core: Drop the unused try_get_task_struct() helper function
  sched/fair: WARN() and refuse to set buddy when !se->on_rq
  sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
  sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
  sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
  sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
  sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h>
  ...
2017-07-03 13:08:04 -07:00
Linus Torvalds
c6b1e36c8f Merge branch 'for-4.13/block' of git://git.kernel.dk/linux-block
Pull core block/IO updates from Jens Axboe:
 "This is the main pull request for the block layer for 4.13. Not a huge
  round in terms of features, but there's a lot of churn related to some
  core cleanups.

  Note this depends on the UUID tree pull request, that Christoph
  already sent out.

  This pull request contains:

   - A series from Christoph, unifying the error/stats codes in the
     block layer. We now use blk_status_t everywhere, instead of using
     different schemes for different places.

   - Also from Christoph, some cleanups around request allocation and IO
     scheduler interactions in blk-mq.

   - And yet another series from Christoph, cleaning up how we handle
     and do bounce buffering in the block layer.

   - A blk-mq debugfs series from Bart, further improving on the support
     we have for exporting internal information to aid debugging IO
     hangs or stalls.

   - Also from Bart, a series that cleans up the request initialization
     differences across types of devices.

   - A series from Goldwyn Rodrigues, allowing the block layer to return
     failure if we will block and the user asked for non-blocking.

   - Patch from Hannes for supporting setting loop devices block size to
     that of the underlying device.

   - Two series of patches from Javier, fixing various issues with
     lightnvm, particular around pblk.

   - A series from me, adding support for write hints. This comes with
     NVMe support as well, so applications can help guide data placement
     on flash to improve performance, latencies, and write
     amplification.

   - A series from Ming, improving and hardening blk-mq support for
     stopping/starting and quiescing hardware queues.

   - Two pull requests for NVMe updates. Nothing major on the feature
     side, but lots of cleanups and bug fixes. From the usual crew.

   - A series from Neil Brown, greatly improving the bio rescue set
     support. Most notably, this kills the bio rescue work queues, if we
     don't really need them.

   - Lots of other little bug fixes that are all over the place"

* 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
  lightnvm: pblk: set line bitmap check under debug
  lightnvm: pblk: verify that cache read is still valid
  lightnvm: pblk: add initialization check
  lightnvm: pblk: remove target using async. I/Os
  lightnvm: pblk: use vmalloc for GC data buffer
  lightnvm: pblk: use right metadata buffer for recovery
  lightnvm: pblk: schedule if data is not ready
  lightnvm: pblk: remove unused return variable
  lightnvm: pblk: fix double-free on pblk init
  lightnvm: pblk: fix bad le64 assignations
  nvme: Makefile: remove dead build rule
  blk-mq: map all HWQ also in hyperthreaded system
  nvmet-rdma: register ib_client to not deadlock in device removal
  nvme_fc: fix error recovery on link down.
  nvmet_fc: fix crashes on bad opcodes
  nvme_fc: Fix crash when nvme controller connection fails.
  nvme_fc: replace ioabort msleep loop with completion
  nvme_fc: fix double calls to nvme_cleanup_cmd()
  nvme-fabrics: verify that a controller returns the correct NQN
  nvme: simplify nvme_dev_attrs_are_visible
  ...
2017-07-03 10:34:51 -07:00
Linus Torvalds
81e3e04489 Merge tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid
Pull uuid subsystem from Christoph Hellwig:
 "This is the new uuid subsystem, in which Amir, Andy and I have started
  consolidating our uuid/guid helpers and improving the types used for
  them. Note that various other subsystems have pulled in this tree, so
  I'd like it to go in early.

  UUID/GUID summary:

   - introduce the new uuid_t/guid_t types that are going to replace the
     somewhat confusing uuid_be/uuid_le types and make the terminology
     fit the various specs, as well as the userspace libuuid library.
     (me, based on a previous version from Amir)

   - consolidated generic uuid/guid helper functions lifted from XFS and
     libnvdimm (Amir and me)

   - conversions to the new types and helpers (Amir, Andy and me)"

* tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits)
  ACPI: hns_dsaf_acpi_dsm_guid can be static
  mmc: sdhci-pci: make guid intel_dsm_guid static
  uuid: Take const on input of uuid_is_null() and guid_is_null()
  thermal: int340x_thermal: fix compile after the UUID API switch
  thermal: int340x_thermal: Switch to use new generic UUID API
  acpi: always include uuid.h
  ACPI: Switch to use generic guid_t in acpi_evaluate_dsm()
  ACPI / extlog: Switch to use new generic UUID API
  ACPI / bus: Switch to use new generic UUID API
  ACPI / APEI: Switch to use new generic UUID API
  acpi, nfit: Switch to use new generic UUID API
  MAINTAINERS: add uuid entry
  tmpfs: generate random sb->s_uuid
  scsi_debug: switch to uuid_t
  nvme: switch to uuid_t
  sysctl: switch to use uuid_t
  partitions/ldm: switch to use uuid_t
  overlayfs: use uuid_t instead of uuid_be
  fs: switch ->s_uuid to uuid_t
  ima/policy: switch to use uuid_t
  ...
2017-07-03 09:55:26 -07:00
Christoph Hellwig
9b2970aacf xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
Switch to the iomap_seek_hole and iomap_seek_data helpers for
implementing lseek SEEK_HOLE / SEEK_DATA, and remove all the
code that isn't needed any more.

Based on patches from Andreas Gruenbacher <agruenba@redhat.com>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-02 22:46:13 -07:00
Andreas Gruenbacher
0ed3b0d45f vfs: Add iomap_seek_hole and iomap_seek_data helpers
Filesystems can use this for implementing lseek SEEK_HOLE / SEEK_DATA
support via iomap.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
[hch: split functions, coding style cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-02 22:46:13 -07:00
Andreas Gruenbacher
334fd34d76 vfs: Add page_cache_seek_hole_data helper
Both ext4 and xfs implement seeking for the next hole or piece of data
in unwritten extents by scanning the page cache, and both versions share
the same bug when iterating the buffers of a page: the start offset into
the page isn't taken into account, so when a page fits more than two
filesystem blocks, things will go wrong.  For example, on a filesystem
with a block size of 1k, the following command will fail:

  xfs_io -f -c "falloc 0 4k" \
            -c "pwrite 1k 1k" \
            -c "pwrite 3k 1k" \
            -c "seek -a -r 0" foo

In this example, neither lseek(fd, 1024, SEEK_HOLE) nor lseek(fd, 2048,
SEEK_DATA) will return the correct result.

Introduce a generic vfs helper for seeking in the page cache that gets
this right.  The next commits will replace the filesystem specific
implementations.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
[hch: dropped the export]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-02 22:46:13 -07:00
Steve French
1955880b2c SMB3: Enable encryption for SMB3.1.1
We were missing a capability flag for SMB3.1.1

Signed-off-by: Steve French <steve.french@primarydata.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-07-02 16:49:06 -05:00
Christoph Hellwig
7175a11214 xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-01 21:09:33 -07:00
Christoph Hellwig
bda250dbaf xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
This goes straight to a single lookup in the extent list and avoids a
roundtrip through two layers that don't add any value for the simple
quoata file that just has data or holes and no page cache, delayed
allocation, unwritten extent or COW fork (which btw, doesn't seem to
be handled by the existing SEEK HOLE/DATA code).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-01 21:09:33 -07:00
Carlos Maiolino
d04c241c66 xfs: Check for m_errortag initialization in xfs_errortag_test
While adding error injection into IO completion, I notice the lack of
initialization check in xfs_errortag_test(), make the error injection
mechanism unable to be used there.

IO completion is executed a few times before the error injection
mechanism is initialized, so to be safer, make xfs_errortag_test() check
if the errortag is properly initialized.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-01 21:08:47 -07:00
Kees Cook
3859a271a0 randstruct: Mark various structs for randomization
This marks many critical kernel structures for randomization. These are
structures that have been targeted in the past in security exploits, or
contain functions pointers, pointers to function pointer tables, lists,
workqueues, ref-counters, credentials, permissions, or are otherwise
sensitive. This initial list was extracted from Brad Spengler/PaX Team's
code in the last public patch of grsecurity/PaX based on my understanding
of the code. Changes or omissions from the original code are mine and
don't reflect the original grsecurity/PaX code.

Left out of this list is task_struct, which requires special handling
and will be covered in a subsequent patch.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-06-30 12:00:51 -07:00
Linus Torvalds
86c3e00afd Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "Fix two bugs in copy-up code. One introduced in 4.11 and one in
  4.12-rc"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: don't set origin on broken lower hardlink
  ovl: copy-up: don't unlock between lookup and link
2017-06-30 10:22:59 -07:00
David S. Miller
b079115937 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
A set of overlapping changes in macvlan and the rocker
driver, nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-30 12:43:08 -04:00
Deepa Dinamani
bff412036f timerfd: Use get_itimerspec64() and put_itimerspec64()
Usage of these apis and their compat versions makes
the syscalls: timerfd_settime and timerfd_gettime and
their compat implementations simpler.

This patch also serves as a preparatory patch for changing
syscalls to use new time_t data types to support the
y2038 effort by isolating the processing of user pointers
through these apis.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 04:14:38 -04:00
Carlos Maiolino
a8e2b63677 Make statfs properly return read-only state after emergency remount
Emergency remount (sysrq-u) sets MS_RDONLY to the superblock but doesn't set
MNT_READONLY to the mount point.

Once calculate_f_flags() only check for the mount point read only state,
when setting kstatfs flags, after an emergency remount, statfs does not
report the filesystem as read-only, even though it is.

Enable flags_by_sb() to also check for superblock read only state, so the
kstatfs and consequently statfs can properly show the read-only state of
the filesystem.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 20:21:06 -04:00
Sebastian Andrzej Siewior
6916363f30 fs/dcache: init in_lookup_hashtable
in_lookup_hashtable was introduced in commit 94bdd655ca ("parallel
lookups machinery, part 3") and never initialized but since it is in
the data it is all zeros. But we need this for -RT.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 20:17:14 -04:00
Denys Vlasenko
4f2ed69414 minix: Deinline get_block, save 2691 bytes
This function compiles to 1402 bytes of machine code.

It has 2 callsites, and also a not-inlined copy gets created by compiler
anyway since its address gets passed as a parameter to block_truncate_page().

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 20:09:12 -04:00
Kees Cook
cc658db47d fs: Reorder inode_owner_or_capable() to avoid needless
Checking for capabilities should be the last operation when performing
access control tests so that PF_SUPERPRIV is set only when it was required
for success (implying that the capability was needed for the operation).

Reported-by: Solar Designer <solar@openwall.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 20:08:32 -04:00
Luis R. Rodriguez
41124db869 fs: warn in case userspace lied about modprobe return
kmod <= v19 was broken -- it could return 0 to modprobe calls,
incorrectly assuming that a kernel module was built-in, whereas in
reality the module was just forming in the kernel. The reason for this
is an incorrect userspace heuristics. A userspace kmod fix is available
for it [0], however should userspace break again we could go on with
an failed get_fs_type() which is hard to debug as the request_module()
is detected as returning 0. The first suspect would be that there is
something worth with the kernel's module loader and obviously in this
case that is not the issue.

Since these issues are painful to debug complain when we know userspace
has outright lied to us.

[0] http://git.kernel.org/cgit/utils/kernel/kmod/kmod.git/commit/libkmod/libkmod-module.c?id=fd44a98ae2eb5eb32161088954ab21e58e19dfc4

Suggested-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 20:05:43 -04:00
Christoph Hellwig
a4058c5bce nfsd: remove nfsd_vfs_read
Simpler done in the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 17:49:24 -04:00
Christoph Hellwig
73da852e38 nfsd: use vfs_iter_read/write
Instead of messing with the address limit to use vfs_read/vfs_writev.

Note that this requires that exported file implement ->read_iter and
->write_iter.  All currently exportable file systems do this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 17:49:24 -04:00
Christoph Hellwig
abbb65899a fs: implement vfs_iter_write using do_iter_write
De-dupliate some code and allow for passing the flags argument to
vfs_iter_write.  Additionally it now properly updates timestamps.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 17:49:23 -04:00
Christoph Hellwig
18e9710ee5 fs: implement vfs_iter_read using do_iter_read
De-dupliate some code and allow for passing the flags argument to
vfs_iter_read.  Additional it properly updates atime now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 17:49:23 -04:00
Christoph Hellwig
edab5fe38c fs: move more code into do_iter_read/do_iter_write
The checks for the permissions and can read / write flags are common
for the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 17:49:22 -04:00
Christoph Hellwig
19c735868d fs: remove __do_readv_writev
Split it into one helper each for reads vs writes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 17:49:22 -04:00
Christoph Hellwig
26c87fb7d1 fs: remove do_compat_readv_writev
opencode it in both callers to simplify the call stack a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 17:49:21 -04:00
Christoph Hellwig
251b42a1dc fs: remove do_readv_writev
opencode it in both callers to simplify the call stack a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 17:49:21 -04:00
Linus Torvalds
374bf8831a Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Two fixes that should go into this release.

  One is an nvme regression fix from Keith, fixing a missing queue
  freeze if the controller is being reset. This causes the reset to
  hang.

  The other is a fix for a leak of the bio protection info, if smaller
  sized O_DIRECT is used. This fix should be more involved as we have
  other problematic paths in the kernel, but given as this isn't a
  regression in this series, we'll tackle those for 4.13"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: provide bio_uninit() free freeing integrity/task associations
  nvme/pci: Fix stuck nvme reset
2017-06-29 14:10:37 -07:00
Qu Wenruo
848c23b78f btrfs: Remove false alert when fiemap range is smaller than on-disk extent
Commit 4751832da9 ("btrfs: fiemap: Cache and merge fiemap extent before
submit it to user") introduced a warning to catch unemitted cached
fiemap extent.

However such warning doesn't take the following case into consideration:

0			4K			8K
|<---- fiemap range --->|
|<----------- On-disk extent ------------------>|

In this case, the whole 0~8K is cached, and since it's larger than
fiemap range, it break the fiemap extent emit loop.
This leaves the fiemap extent cached but not emitted, and caught by the
final fiemap extent sanity check, causing kernel warning.

This patch removes the kernel warning and renames the sanity check to
emit_last_fiemap_cache() since it's possible and valid to have cached
fiemap extent.

Reported-by: David Sterba <dsterba@suse.cz>
Reported-by: Adam Borowski <kilobyte@angband.pl>
Fixes: 4751832da9 ("btrfs: fiemap: Cache and merge fiemap extent ...")
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:25:20 +02:00
Jan Kara
b7f8a09f80 btrfs: Don't clear SGID when inheriting ACLs
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by moving posix_acl_update_mode() out of
__btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.

Fixes: 073931017b
CC: stable@vger.kernel.org
CC: linux-btrfs@vger.kernel.org
CC: David Sterba <dsterba@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:24:59 +02:00
Chris Mason
6374e57ad8 btrfs: fix integer overflow in calc_reclaim_items_nr
Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
v4.12-rc6.  This was because commit 70e7af244 made it possible for
calc_reclaim_items_nr() to return a negative number.  It's not really a
bug in that commit, it just didn't go far enough down the stack to find
all the possible 64->32 bit overflows.

This switches calc_reclaim_items_nr() to return a u64 and changes everyone
that uses the results of that math to u64 as well.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
David Sterba
ded56184a5 btrfs: scrub: fix target device intialization while setting up scrub context
The commit "btrfs: scrub: inline helper scrub_setup_wr_ctx" inlined a
helper but wrongly sets up the target device. Incidentally there's a
local variable with the same name as a parameter in the previous
function, so this got caught during runtime as crash in test btrfs/027.

Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo
bc42bda223 btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
[BUG]
For the following case, btrfs can underflow qgroup reserved space
at an error path:
(Page size 4K, function name without "btrfs_" prefix)

         Task A                  |             Task B
----------------------------------------------------------------------
Buffered_write [0, 2K)           |
|- check_data_free_space()       |
|  |- qgroup_reserve_data()      |
|     Range aligned to page      |
|     range [0, 4K)          <<< |
|     4K bytes reserved      <<< |
|- copy pages to page cache      |
                                 | Buffered_write [2K, 4K)
                                 | |- check_data_free_space()
                                 | |  |- qgroup_reserved_data()
                                 | |     Range alinged to page
                                 | |     range [0, 4K)
                                 | |     Already reserved by A <<<
                                 | |     0 bytes reserved      <<<
                                 | |- delalloc_reserve_metadata()
                                 | |  And it *FAILED* (Maybe EQUOTA)
                                 | |- free_reserved_data_space()
                                      |- qgroup_free_data()
                                         Range aligned to page range
                                         [0, 4K)
                                         Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)

[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.

And at writeback time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.

[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.

Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo
364ecf3651 btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
Introduce a new parameter, struct extent_changeset for
btrfs_qgroup_reserved_data() and its callers.

Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
which range it reserved in current reserve, so it can free it in error
paths.

The reason we need to export it to callers is, at buffered write error
path, without knowing what exactly which range we reserved in current
allocation, we can free space which is not reserved by us.

This will lead to qgroup reserved space underflow.

Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo
a12b877b55 btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
[BUG]
Under the following case, we can underflow qgroup reserved space.

            Task A                |            Task B
---------------------------------------------------------------
 Quota disabled                   |
 Buffered write                   |
 |- btrfs_check_data_free_space() |
 |  *NO* qgroup space is reserved |
 |  since quota is *DISABLED*     |
 |- All pages are copied to page  |
    cache                         |
                                  | Enable quota
                                  | Quota scan finished
                                  |
                                  | Sync_fs
                                  | |- run_delalloc_range
                                  | |- Write pages
                                  | |- btrfs_finish_ordered_io
                                  |    |- insert_reserved_file_extent
                                  |       |- btrfs_qgroup_release_data()
                                  |          Since no qgroup space is
                                             reserved in Task A, we
                                             underflow qgroup reserved
                                             space
This can be detected by fstest btrfs/104.

[CAUSE]
In insert_reserved_file_extent() we tell qgroup to release the @ram_bytes
size of qgroup reserved_space in all cases.
And btrfs_qgroup_release_data() will check if quotas are enabled.

However in the above case, the buffered write happens before quota is
enabled, so we don't have the reserved space for that range.

[FIX]
In insert_reserved_file_extent(), we tell qgroup to release the acctual
byte number it released.
In the above case, since we don't have the reserved space, we tell
qgroups to release 0 byte, so the problem can be fixed.

And thanks to the @reserved parameter introduced by the qgroup rework,
and previous patch to return released bytes, the fix can be as small as
10 lines.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[ changelog updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo
7bc329c183 btrfs: qgroup: Return actually freed bytes for qgroup release or free data
btrfs_qgroup_release/free_data() only returns 0 or a negative error
number (ENOMEM is the only possible error).

This is normally good enough, but sometimes we need the exact byte
count it freed/released.

Change it to return actually released/freed bytenr number instead of 0
for success.
And slightly modify related extent_changeset structure, since in btrfs
one no-hole data extent won't be larger than 128M, so "unsigned int"
is large enough for the use case.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00