Commit Graph

5689 Commits

Author SHA1 Message Date
Bart Van Assche
9a8ac3ae68 dm mpath: cleanup QUEUE_IF_NO_PATH bit manipulation by introducing assign_bit()
No functional change but makes the code easier to read.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 17:08:46 -04:00
Bart Van Assche
ca5beb76c3 dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH
Instead of checking MPATHF_QUEUE_IF_NO_PATH,
MPATHF_SAVED_QUEUE_IF_NO_PATH and the no_flush flag to decide whether
or not to push back a request (or bio) if there are no paths available,
only clear MPATHF_QUEUE_IF_NO_PATH in queue_if_no_path() if no_flush has
not been set.  The result is that only a single bit has to be tested in
the hot path to decide whether or not a request must be pushed back and
also that m->lock does not have to be taken in the hot path.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 17:08:45 -04:00
Bart Van Assche
7e0d574f26 dm: introduce enum dm_queue_mode to cleanup related code
Introduce an enumeration type for the queue mode.  This patch does
not change any functionality but makes the DM code easier to read.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 17:08:44 -04:00
Bart Van Assche
b194679fac dm mpath: verify __pg_init_all_paths locking assumptions at runtime
Verify at runtime that __pg_init_all_paths() is called with
multipath.lock held if lockdep is enabled.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 17:08:43 -04:00
Bart Van Assche
1ea0654e46 dm: verify suspend_locking assumptions at runtime
Ensure that the assumptions about the caller holding suspend_lock
are checked at runtime if lockdep is enabled.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 17:08:42 -04:00
Bart Van Assche
73cbca6a63 dm block manager: remove an unused argument from dm_block_manager_create()
The 'cache_size' argument of dm_block_manager_create() has never been
used.  Remove it along with the definitions of the constants passed as
the 'cache_size' argument.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 17:08:41 -04:00
Bart Van Assche
23a6012489 dm rq: check blk_mq_register_dev() return value in dm_mq_init_request_queue()
Otherwise the request-based DM blk-mq request_queue will be put into
service without being properly exported via sysfs.

Cc: stable@vger.kernel.org
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 17:08:40 -04:00
Bart Van Assche
c1d7ecf7ca dm mpath: delay requeuing while path initialization is in progress
Requeuing a request immediately while path initialization is ongoing
causes high CPU usage, something that is undesired.  Hence delay
requeuing while path initialization is in progress.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 17:08:01 -04:00
Bart Van Assche
7083abbbfc dm mpath: avoid that path removal can trigger an infinite loop
If blk_get_request() fails, check whether the failure is due to a path
being removed.  If that is the case, fail the path by triggering a call
to fail_path().  This avoids that the following scenario can be
encountered while removing paths:
* CPU usage of a kworker thread jumps to 100%.
* Removing the DM device becomes impossible.

Delay requeueing if blk_get_request() returns -EBUSY or -EWOULDBLOCK,
and the queue is not dying, because in these cases immediate requeuing
is inappropriate.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 17:04:27 -04:00
Xiao Ni
43ac9b84a3 md/raid1: Use a new variable to count flighting sync requests
In new barrier codes, raise_barrier waits if conf->nr_pending[idx] is not zero.
After all the conditions are true, the resync request can go on be handled. But
it adds conf->nr_pending[idx] again. The next resync request hit the same bucket
idx need to wait the resync request which is submitted before. The performance
of resync/recovery is degraded.
So we should use a new variable to count sync requests which are in flight.

I did a simple test:
1. Without the patch, create a raid1 with two disks. The resync speed:
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00  166.00    0.00    10.38     0.00   128.00     0.03    0.20    0.20    0.00   0.19   3.20
sdc               0.00     0.00    0.00  166.00     0.00    10.38   128.00     0.96    5.77    0.00    5.77   5.75  95.50
2. With the patch, the result is:
sdb            2214.00     0.00  766.00    0.00   185.69     0.00   496.46     2.80    3.66    3.66    0.00   1.03  79.10
sdc               0.00  2205.00    0.00  769.00     0.00   186.44   496.52     5.25    6.84    0.00    6.84   1.30 100.10

Suggested-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-27 14:01:16 -07:00
Bart Van Assche
89bfce763e dm mpath: split and rename activate_path() to prepare for its expanded use
activate_path() is renamed to activate_path_work() which now calls
activate_or_offline_path().  activate_or_offline_path() will be used
by the next commit.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 17:00:35 -04:00
Adrian Salido
4617f564c0 dm ioctl: prevent stack leak in dm ioctl call
When calling a dm ioctl that doesn't process any data
(IOCTL_FLAGS_NO_PARAMS), the contents of the data field in struct
dm_ioctl are left initialized.  Current code is incorrectly extending
the size of data copied back to user, causing the contents of kernel
stack to be leaked to user.  Fix by only copying contents before data
and allow the functions processing the ioctl to override.

Cc: stable@vger.kernel.org
Signed-off-by: Adrian Salido <salidoa@google.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 13:55:13 -04:00
Mikulas Patocka
84ff1bcc2e dm integrity: use previously calculated log2 of sectors_per_block
The log2 of sectors_per_block was already calculated, so we don't have
to use the ilog2 function.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 12:16:32 -04:00
Mikulas Patocka
6625d90325 dm integrity: use hex2bin instead of open-coded variant
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 12:10:16 -04:00
Andy Shevchenko
e944e03e33 dm crypt: replace custom implementation of hex2bin()
There is no need to have a duplication of the generic library, i.e. hex2bin().
Replace the open coded variant.

Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-27 12:08:31 -04:00
Dan Williams
d4b29fd78e block: remove block_device_operations ->direct_access()
Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25 13:20:46 -07:00
Dan Williams
817bf40265 dm: teach dm-targets to use a dax_device + dax_operations
Arrange for dm to lookup the dax services available from member devices.
Update the dax-capable targets, linear and stripe, to route dax
operations to the underlying device. Changes the target-internal
->direct_access() method to more closely align with the dax_operations
->direct_access() calling convention.

Cc: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25 13:20:36 -07:00
Eric Biggers
86f917adea dm crypt: remove obsolete references to per-CPU state
dm-crypt used to use separate crypto transforms for each CPU, but this
is no longer the case.  To avoid confusion, fix up obsolete comments and
rename setup_essiv_cpu().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-25 16:12:03 -04:00
Guoqing Jiang
e5bc9c3c54 md: clear WantReplacement once disk is removed
We can clear 'WantReplacement' flag directly no
matter it's replacement existed or not since the
semantic is same as before.

Also since the disk is removed from array, then
it is straightforward to remove 'WantReplacement'
flag and the comments in raid10/5 can be removed
as well.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-25 09:36:29 -07:00
Gilad Ben-Yossef
d1ac3ff008 dm verity: switch to using asynchronous hash crypto API
Use of the synchronous digest API limits dm-verity to using pure
CPU based algorithm providers and rules out the use of off CPU
algorithm providers which are normally asynchronous by nature,
potentially freeing CPU cycles.

This can reduce performance per Watt in situations such as during
boot time when a lot of concurrent file accesses are made to the
protected volume.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
CC: Eric Biggers <ebiggers3@gmail.com>
CC: Ondrej Mosnáček <omosnacek+linux-crypto@gmail.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 15:37:04 -04:00
Tim Murray
a1b89132dc dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
Running dm-crypt with workqueues at the standard priority results in IO
competing for CPU time with standard user apps, which can lead to
pipeline bubbles and seriously degraded performance.  Move to using
WQ_HIGHPRI workqueues to protect against that.

Signed-off-by: Tim Murray <timmurray@google.com>
Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 15:32:07 -04:00
Ondrej Kozina
c82feeec9a dm crypt: rewrite (wipe) key in crypto layer using random data
The message "key wipe" used to wipe real key stored in crypto layer by
rewriting it with zeroes.  Since commit 28856a9 ("crypto: xts -
consolidate sanity check for keys") this no longer works in FIPS mode
for XTS.

While running in FIPS mode the crypto key part has to differ from the
tweak key.

Fixes: 28856a9 ("crypto: xts - consolidate sanity check for keys")
Cc: stable@vger.kernel.org
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 15:16:03 -04:00
Bart Van Assche
06eb061f48 dm mpath: requeue after a small delay if blk_get_request() fails
If blk_get_request() returns ENODEV then multipath_clone_and_map()
causes a request to be requeued immediately. This can cause a kworker
thread to spend 100% of the CPU time of a single core in
__blk_mq_run_hw_queue() and also can cause device removal to never
finish.

Avoid this by only requeuing after a delay if blk_get_request() fails.
Additionally, reduce the requeue delay.

Cc: stable@vger.kernel.org # 4.9+
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 15:06:19 -04:00
Somasundaram Krishnasamy
117aceb030 dm era: save spacemap metadata root after the pre-commit
When committing era metadata to disk, it doesn't always save the latest
spacemap metadata root in superblock. Due to this, metadata is getting
corrupted sometimes when reopening the device. The correct order of update
should be, pre-commit (shadows spacemap root), save the spacemap root
(newly shadowed block) to in-core superblock and then the final commit.

Cc: stable@vger.kernel.org
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 15:02:14 -04:00
Dennis Yang
948f581a53 dm thin: fix a memory leak when passing discard bio down
dm-thin does not free the discard_parent bio after all chained sub
bios finished. The following kmemleak report could be observed after
pool with discard_passdown option processes discard bios in
linux v4.11-rc7. To fix this, we drop the discard_parent bio reference
when its endio (passdown_endio) called.

unreferenced object 0xffff8803d6b29700 (size 256):
  comm "kworker/u8:0", pid 30349, jiffies 4379504020 (age 143002.776s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    01 00 00 00 00 00 00 f0 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81a5efd9>] kmemleak_alloc+0x49/0xa0
    [<ffffffff8114ec34>] kmem_cache_alloc+0xb4/0x100
    [<ffffffff8110eec0>] mempool_alloc_slab+0x10/0x20
    [<ffffffff8110efa5>] mempool_alloc+0x55/0x150
    [<ffffffff81374939>] bio_alloc_bioset+0xb9/0x260
    [<ffffffffa018fd20>] process_prepared_discard_passdown_pt1+0x40/0x1c0 [dm_thin_pool]
    [<ffffffffa018b409>] break_up_discard_bio+0x1a9/0x200 [dm_thin_pool]
    [<ffffffffa018b484>] process_discard_cell_passdown+0x24/0x40 [dm_thin_pool]
    [<ffffffffa018b24d>] process_discard_bio+0xdd/0xf0 [dm_thin_pool]
    [<ffffffffa018ecf6>] do_worker+0xa76/0xd50 [dm_thin_pool]
    [<ffffffff81086239>] process_one_work+0x139/0x370
    [<ffffffff810867b1>] worker_thread+0x61/0x450
    [<ffffffff8108b316>] kthread+0xd6/0xf0
    [<ffffffff81a6cd1f>] ret_from_fork+0x3f/0x70
    [<ffffffffffffffff>] 0xffffffffffffffff

Cc: stable@vger.kernel.org
Signed-off-by: Dennis Yang <dennisyang@qnap.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 14:58:10 -04:00
Vinothkumar Raja
7d1fedb6e9 dm btree: fix for dm_btree_find_lowest_key()
dm_btree_find_lowest_key() is giving incorrect results.  find_key()
traverses the btree correctly for finding the highest key, but there is
an error in the way it traverses the btree for retrieving the lowest
key.  dm_btree_find_lowest_key() fetches the first key of the rightmost
block of the btree instead of fetching the first key from the leftmost
block.

Fix this by conditionally passing the correct parameter to value64()
based on the @find_highest flag.

Cc: stable@vger.kernel.org
Signed-off-by: Erez Zadok <ezk@fsl.cs.sunysb.edu>
Signed-off-by: Vinothkumar Raja <vinraja@cs.stonybrook.edu>
Signed-off-by: Nidhi Panpalia <npanpalia@cs.stonybrook.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 14:47:49 -04:00
Matthias Kaehlcke
e36215d87f dm ioctl: remove double parentheses
The extra pair of parantheses is not needed and causes clang to generate
warnings about the DM_DEV_CREATE_CMD comparison in validate_params().

Also remove another double parentheses that doesn't cause a warning.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 14:31:53 -04:00
Mikulas Patocka
9119fedddb dm: remove dummy dm_table definition
This dummy structure definition was required for RCU macros, but it
isn't required anymore, so delete it.

The dummy definition confuses the crash tool, see:
https://www.redhat.com/archives/dm-devel/2017-April/msg00197.html

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 12:04:35 -04:00
Mikulas Patocka
583fe7474c dm crypt: fix large block integrity support
Previously, dm-crypt could use blocks composed of multiple 512b sectors
but it created integrity profile for each 512b sector (it padded it with
zeroes).  Fix dm-crypt so that the integrity profile is sent for each
block not each sector.

The user must use the same block size in the DM crypt and integrity
targets.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 12:04:34 -04:00
Mikulas Patocka
9d609f85b7 dm integrity: support larger block sizes
The DM integrity block size can now be 512, 1k, 2k or 4k.  Using larger
blocks reduces metadata handling overhead.  The block size can be
configured at table load time using the "block_size:<value>" option;
where <value> is expressed in bytes (defult is still 512 bytes).

It is safe to use larger block sizes with DM integrity, because the
DM integrity journal makes sure that the whole block is updated
atomically even if the underlying device doesn't support atomic writes
of that size (e.g. 4k block ontop of a 512b device).

Depends-on: 2859323e ("block: fix blk_integrity_register to use template's interval_exp if not 0")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 12:04:33 -04:00
Mikulas Patocka
56b67a4f29 dm integrity: various small changes and cleanups
Some coding style changes.

Fix a bug that the array test_tag has insufficient size if the digest
size of internal has is bigger than the tag size.

The function __fls is undefined for zero argument, this patch fixes
undefined behavior if the user sets zero interleave_sectors.

Fix the limit of optional arguments to 8.

Don't allocate crypt_data on the stack to avoid a BUG with debug kernel.

Rename all optional argument names to have underscores rather than
dashes.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 12:04:32 -04:00
Mikulas Patocka
e2460f2a4b dm: mark targets that pass integrity data
A dm-crypt on dm-integrity device incorrectly advertises an integrity
profile on the DM crypt device.  It can be seen in the files
"/sys/block/dm-*/integrity/*" that both dm-integrity and dm-crypt target
advertise the integrity profile.  That is incorrect, only the
dm-integrity target should advertise the integrity profile.

A general problem in DM is that if we have a DM device that depends on
another device with an integrity profile, the upper device will always
advertise the integrity profile, even when the target driver doesn't
support handling integrity data.

Most targets don't support integrity data, so we provide a whitelist of
targets that support it (linear, delay and striped).  The targets that
support passing integrity data to the lower device are marked with the
flag DM_TARGET_PASSES_INTEGRITY.  The DM core will now advertise
integrity data on a DM device only if all the targets support the
integrity data.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 12:04:32 -04:00
Mikulas Patocka
3c12016910 dm table: replace while loops with for loops
Also remove some unnecessary use of uninitialized_var().

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-24 12:04:31 -04:00
Lidong Zhong
296617581e md/raid1/10: remove unused queue
A queue is declared and get from the disk of the array, but it's not
used anywhere. So removing it from the source.

Signed-off-by: Lidong Zhong <lzhong@suse.com>
Acted-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-23 16:59:13 -07:00
NeilBrown
97b20ef784 md: handle read-only member devices better.
1/ If an array has any read-only devices when it is started,
   the array itself must be read-only
2/ A read-only device cannot be added to an array after it is
   started.
3/ Setting an array to read-write should not succeed
   if any member devices are read-only

Reported-and-Tested-by: Nanda Kishore Chinnaram <Nanda_Kishore_Chinna@dell.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-20 13:25:51 -07:00
Dan Williams
f26c5719b2 dm: add dax_device and dax_operations support
Allocate a dax_device to represent the capacity of a device-mapper
instance. Provide a ->direct_access() method via the new dax_operations
indirection that mirrors the functionality of the current direct_access
support via block_device_operations.  Once fs/dax.c has been converted
to use dax_operations the old dm_blk_direct_access() will be removed.

A new helper dm_dax_get_live_target() is introduced to separate some of
the dm-specifics from the direct_access implementation.

This enabling is only for the top-level dm representation to upper
layers. Converting target direct_access implementations is deferred to a
separate patch.

Cc: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-20 11:57:52 -07:00
Christoph Hellwig
08e0029aa2 blk-mq: remove the error argument to blk_mq_complete_request
Now that all drivers that call blk_mq_complete_requests have a
->complete callback we can remove the direct call to blk_mq_end_request,
as well as the error argument to blk_mq_complete_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:16:10 -06:00
Christoph Hellwig
8fc7798058 dm mpath: don't check for req->errors
We'll get all proper errors reported through ->end_io and ->errors will
go away soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:16:10 -06:00
Christoph Hellwig
e0af413a45 dm rq: don't pass irrelevant error code to blk_mq_complete_request
dm never uses rq->errors, so there is no need to pass an error argument
to blk_mq_complete_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:16:10 -06:00
Guoqing Jiang
cf25ae78fc md/raid10: wait up frozen array in handle_write_completed
Since nr_queued is changed, we need to call wake_up here
if the array is already frozen and waiting for condition
"nr_pending == nr_queued + extra" to be true.

And commit 824e47dadd ("RAID1: avoid unnecessary spin
locks in I/O barrier code") which has already added the
wake_up for raid1.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-20 09:55:52 -07:00
Christophe JAILLET
835d89e92f md-cluster: Fix a memleak in an error handling path
We know that 'bm_lockres' is NULL here, so 'lockres_free(bm_lockres)' is a
no-op. According to resource handling in case of error a few lines below,
it is likely that 'bitmap_free(bitmap)' was expected instead.

Fixes: b98938d16a ("md-cluster: introduce cluster_check_sync_size")

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-14 08:08:29 -07:00
NeilBrown
78b6350dca md: support disabling of create-on-open semantics.
md allows a new array device to be created by simply
opening a device file.  This make it difficult to
remove the device and udev is likely to open the device file
as part of processing the REMOVE event.

There is an alternate mechanism for creating arrays
by writing to the new_array module parameter.
When using tools that work with this parameter, it is
best to disable the old semantics.
This new module parameter allows that.

Signed-off-by: NeilBrown <neilb@suse.com>
Acted-by: Coly Li <colyli@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-12 12:30:17 -07:00
NeilBrown
039b7225e6 md: allow creation of mdNNN arrays via md_mod/parameters/new_array
The intention when creating the "new_array" parameter and the
possibility of having array names line "md_HOME" was to transition
away from the old way of creating arrays and to eventually only use
this new way.

The "old" way of creating array is to create a device node in /dev
and then open it.  The act of opening creates the array.
This is problematic because sometimes the device node can be opened
when we don't want to create an array.  This can easily happen
when some rule triggered by udev looks at a device as it is being
destroyed.  The node in /dev continues to exist for a short period
after an array is stopped, and opening it during this time recreates
the array (as an inactive array).

Unfortunately no clear plan for the transition was created.  It is now
time to fix that.

This patch allows devices with numeric names, like "md999" to be
created by writing to "new_array".  This will only work if the minor
number given is not already in use.  This will allow mdadm to
support the creation of arrays with numbers > 511 (currently not
possible) by writing to new_array.
mdadm can, at some point, use this approach to create *all* arrays,
which will allow the transition to only using the new-way.

Signed-off-by: NeilBrown <neilb@suse.com>
Acted-by: Coly Li <colyli@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-12 12:30:11 -07:00
Artur Paszkiewicz
fcd403aff6 raid5-ppl: use a single mempool for ppl_io_unit and header_page
Allocate both struct ppl_io_unit and its header_page from a shared
mempool to avoid a possible deadlock. Implement allocate and free
functions for the mempool, remove the second pool for allocating
header_page. The header_pages are now freed with their io_units, not
when the ppl bio completes. Also, use GFP_NOWAIT instead of GFP_ATOMIC
for allocating ppl_io_unit because we can handle failed allocations and
there is no reason to utilize emergency reserves.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11 14:56:46 -07:00
NeilBrown
f00d7c85be md/raid0: fix up bio splitting.
raid0_make_request() should use a private bio_set rather than the
shared fs_bio_set, which is only meant for filesystems to use.

raid0_make_request() shouldn't loop around using the bio_set
multiple times as that can deadlock.

So use mddev->bio_set and pass the tail to generic_make_request()
instead of looping on it.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11 10:18:09 -07:00
NeilBrown
868f604b1d md/linear: improve bio splitting.
linear_make_request() uses fs_bio_set, which is meant for filesystems
to use, and loops, possible allocating  from the same bio set multiple
times.
These behaviors can theoretically cause deadlocks, though as
linear requests are hardly ever split, it is unlikely in practice.

Change to use mddev->bio_set - otherwise unused for linear, and submit
the tail of a split request to generic_make_request() for it to
handle.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11 10:17:55 -07:00
NeilBrown
dd7a8f5dee md/raid5: make chunk_aligned_read() split bios more cleanly.
chunk_aligned_read() currently uses fs_bio_set - which is meant for
filesystems to use - and loops if multiple splits are needed, which is
not best practice.
As this is only used for READ requests, not writes, it is unlikely
to cause a problem.  However it is best to be consistent in how
we split bios, and to follow the pattern used in raid1/raid10.

So create a private bioset, bio_split, and use it to perform a single
split, submitting the remainder to generic_make_request() for later
processing.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11 10:16:50 -07:00
NeilBrown
545250f248 md/raid10: simplify handle_read_error()
handle_read_error() duplicates a lot of the work that raid10_read_request()
does, so it makes sense to just use that function.

handle_read_error() relies on the same r10bio being re-used so that,
in the case of a read-only array, setting IO_BLOCKED in r1bio->devs[].bio
ensures read_balance() won't re-use that device.
So when called from raid10_make_request() we clear that array, but not
when called from handle_read_error().

Two parts of handle_read_error() that need to be preserved are the warning
message it prints, so they are conditionally added to
raid10_read_request().  If the failing rdev can be found, messages
are printed.  Otherwise they aren't.

Not that as rdev_dec_pending() has already been called on the failing
rdev, we need to use rcu_read_lock() to get a new reference from
the conf.  We only use this to get the name of the failing block device.

With this change, we no longer need inc_pending().

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11 10:15:08 -07:00
NeilBrown
fc9977dd06 md/raid10: simplify the splitting of requests.
raid10 splits requests in two different ways for two different
reasons.

First, bio_split() is used to ensure the bio fits with a chunk.
Second, multiple r10bio structures are allocated to represent the
different sections that need to go to different devices, to avoid
known bad blocks.

This can be simplified to just use bio_split() once, and not to use
multiple r10bios.
We delay the split until we know a maximum bio size that can
be handled with a single r10bio, and then split the bio and queue
the remainder for later handling.

As with raid1, we allocate a new bio_set to help with the splitting.
It is not correct to use fs_bio_set in a device driver.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11 10:13:02 -07:00
NeilBrown
673ca68d93 md/raid1: factor out flush_bio_list()
flush_pending_writes() and raid1_unplug() each contain identical
copies of a fairly large slab of code.  So factor that out into
new flush_bio_list() to simplify maintenance.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11 10:12:36 -07:00