Commit Graph

5319 Commits

Author SHA1 Message Date
Guoqing Jiang
b3c95b425e md/raid1: remove unnecessary BUG_ON
Since bitmap_start_sync will not return until
sync_blocks is not less than PAGE_SIZE>>9, so
the BUG_ON is not needed anymore.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-14 11:35:58 -07:00
Ming Lei
fafcde3ac1 md: multipath: don't hardcopy bio in .make_request path
Inside multipath_make_request(), multipath maps the incoming
bio into low level device's bio, but it is totally wrong to
copy the bio into mapped bio via '*mapped_bio = *bio'. For
example, .__bi_remaining is kept in the copy, especially if
the incoming bio is chained to via bio splitting, so .bi_end_io
can't be called for the mapped bio at all in the completing path
in this kind of situation.

This patch fixes the issue by using clone style.

Cc: stable@vger.kernel.org (v3.14+)
Reported-and-tested-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-14 11:32:26 -07:00
Mike Snitzer
c3667cc619 dm thin: consistently return -ENOSPC if pool has run out of data space
Commit 0a927c2f02 ("dm thin: return -ENOSPC when erroring retry list due
to out of data space") was a step in the right direction but didn't go
far enough.

Add a new 'out_of_data_space' flag to 'struct pool' and set it if/when
the pool runs of of data space.  This fixes cell_error() and
error_retry_list() to not blindly return -EIO.

We cannot rely on the 'error_if_no_space' feature flag since it is
transient (in that it can be reset once space is added, plus it only
controls whether errors are issued, it doesn't reflect whether the
pool is actually out of space).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-11 16:15:22 -05:00
Mike Snitzer
843f0f2e8f dm cache: bump the target version
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:12 -05:00
Joe Thornber
d14fcf3dd7 dm cache: make sure every metadata function checks fail_io
Otherwise operations may be attempted that will only ever go on to crash
(since the metadata device is either missing or unreliable if 'fail_io'
is set).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2016-03-10 17:12:12 -05:00
Mike Snitzer
3f0680402c dm: add missing newline between DM_DEBUG_BLOCK_STACK_TRACING and DM_BUFIO
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:11 -05:00
Mike Snitzer
7dd85bb0e9 dm cache policy smq: clarify that mq registration failure was for 'mq'
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:11 -05:00
Mike Snitzer
c80914e81e dm: return error if bio_integrity_clone() fails in clone_bio()
clone_bio() now checks if bio_integrity_clone() returned an error rather
than just drop it on the floor.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:10 -05:00
Joe Thornber
2eae9e4489 dm thin metadata: don't issue prefetches if a transaction abort has failed
If a transaction abort has failed then we can no longer use the metadata
device.  Typically this happens if the superblock is unreadable.

This fix addresses a crash seen during metadata device failure testing.

Fixes: 8a01a6af75 ("dm thin: prefetch missing metadata pages")
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:09 -05:00
DingXiang
4df2bf466a dm snapshot: disallow the COW and origin devices from being identical
Otherwise loading a "snapshot" table using the same device for the
origin and COW devices, e.g.:

echo "0 20971520 snapshot 253:3 253:3 P 8" | dmsetup create snap

will trigger:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
[ 1958.979934] IP: [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
[ 1958.989655] PGD 0
[ 1958.991903] Oops: 0000 [#1] SMP
...
[ 1959.059647] CPU: 9 PID: 3556 Comm: dmsetup Tainted: G          IO    4.5.0-rc5.snitm+ #150
...
[ 1959.083517] task: ffff8800b9660c80 ti: ffff88032a954000 task.ti: ffff88032a954000
[ 1959.091865] RIP: 0010:[<ffffffffa040efba>]  [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
[ 1959.104295] RSP: 0018:ffff88032a957b30  EFLAGS: 00010246
[ 1959.110219] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000001
[ 1959.118180] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff880329334a00
[ 1959.126141] RBP: ffff88032a957b50 R08: 0000000000000000 R09: 0000000000000001
[ 1959.134102] R10: 000000000000000a R11: f000000000000000 R12: ffff880330884d80
[ 1959.142061] R13: 0000000000000008 R14: ffffc90001c13088 R15: ffff880330884d80
[ 1959.150021] FS:  00007f8926ba3840(0000) GS:ffff880333440000(0000) knlGS:0000000000000000
[ 1959.159047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1959.165456] CR2: 0000000000000098 CR3: 000000032f48b000 CR4: 00000000000006e0
[ 1959.173415] Stack:
[ 1959.175656]  ffffc90001c13040 ffff880329334a00 ffff880330884ed0 ffff88032a957bdc
[ 1959.183946]  ffff88032a957bb8 ffffffffa040f225 ffff880329334a30 ffff880300000000
[ 1959.192233]  ffffffffa04133e0 ffff880329334b30 0000000830884d58 00000000569c58cf
[ 1959.200521] Call Trace:
[ 1959.203248]  [<ffffffffa040f225>] dm_exception_store_create+0x1d5/0x240 [dm_snapshot]
[ 1959.211986]  [<ffffffffa040d310>] snapshot_ctr+0x140/0x630 [dm_snapshot]
[ 1959.219469]  [<ffffffffa0005c44>] ? dm_split_args+0x64/0x150 [dm_mod]
[ 1959.226656]  [<ffffffffa0005ea7>] dm_table_add_target+0x177/0x440 [dm_mod]
[ 1959.234328]  [<ffffffffa0009203>] table_load+0x143/0x370 [dm_mod]
[ 1959.241129]  [<ffffffffa00090c0>] ? retrieve_status+0x1b0/0x1b0 [dm_mod]
[ 1959.248607]  [<ffffffffa0009e35>] ctl_ioctl+0x255/0x4d0 [dm_mod]
[ 1959.255307]  [<ffffffff813304e2>] ? memzero_explicit+0x12/0x20
[ 1959.261816]  [<ffffffffa000a0c3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 1959.268615]  [<ffffffff81215eb6>] do_vfs_ioctl+0xa6/0x5c0
[ 1959.274637]  [<ffffffff81120d2f>] ? __audit_syscall_entry+0xaf/0x100
[ 1959.281726]  [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70
[ 1959.288814]  [<ffffffff81216449>] SyS_ioctl+0x79/0x90
[ 1959.294450]  [<ffffffff8167e4ae>] entry_SYSCALL_64_fastpath+0x12/0x71
...
[ 1959.323277] RIP  [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
[ 1959.333090]  RSP <ffff88032a957b30>
[ 1959.336978] CR2: 0000000000000098
[ 1959.344121] ---[ end trace b049991ccad1169e ]---

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1195899
Cc: stable@vger.kernel.org
Signed-off-by: Ding Xiang <dingxiang@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:09 -05:00
Joe Thornber
9ed84698fd dm cache: make the 'mq' policy an alias for 'smq'
smq seems to be performing better than the old mq policy in all
situations, as well as using a quarter of the memory.

Make 'mq' an alias for 'smq' when choosing a cache policy.  The tunables
that were present for the old mq are faked, and have no effect.  mq
should be considered deprecated now.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:08 -05:00
Bob Liu
e233d800a9 dm: drop unnecessary assignment of md->queue
md->queue and q are the same thing in dm_old_init_request_queue() and
dm_mq_init_request_queue().

Also drop the temporary 'struct request_queue *q' in
dm_old_init_request_queue().

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:07 -05:00
Mike Snitzer
032482fda4 dm: reorder 'struct mapped_device' members to fix alignment and holes
Saves 16 bytes by eliminating 4 4byte holes but more importantly:
numerous members that crossed cachelines were fixed.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:07 -05:00
Mike Snitzer
1d3aa6f683 dm: remove dummy definition of 'struct dm_table'
Change the map pointer in 'struct mapped_device' from 'struct dm_table
__rcu *' to 'void __rcu *' to avoid the need for the dummy definition.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:06 -05:00
Mike Snitzer
115485e83f dm: add 'dm_numa_node' module parameter
Allows user to control which NUMA node the memory for DM device
structures (e.g. mapped_device, request_queue, gendisk, blk_mq_tag_set)
is allocated from.

Defaults to NUMA_NO_NODE (-1).  Allowable range is from -1 until the
last online NUMA node id.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:06 -05:00
Mike Snitzer
29f929b52d dm thin metadata: remove needless newline from subtree_dec() DMERR message
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:05 -05:00
Mike Snitzer
ec31f3f78a dm mpath: cleanup reinstate_path() et al based on code review
fail_path() will print a "Failing path ..." message but reinstate_path()
doesn't print a "Reinstating path ...".  Add that message to
reinstate_path() to add symmetry and aid system debugging.

Remove reinstate_path()'s check for the path_selector providing
.reinstate_path hook.  All path selectors provide this and any future
ones must too.

activate_path() calls pg_init_done() with SCSI_DH_DEV_OFFLINED but
pg_init_done() doesn't expicitly handle it in its swicth statement.  Add
SCSI_DH_DEV_OFFLINED to the default case.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-10 17:12:04 -05:00
Shaohua Li
fb3229d5cd md/raid5: output stripe state for debug
Neil recently fixed an obscure race in break_stripe_batch_list. Debug would be
quite convenient if we know the stripe state. This is what this patch does.

Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-09 10:08:38 -08:00
NeilBrown
550da24f8d md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list
break_stripe_batch_list breaks up a batch and copies some flags from
the batch head to the members, preserving others.

It doesn't preserve or copy STRIPE_PREREAD_ACTIVE.  This is not
normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a
stripe_head is added to a batch, and is not set on stripe_heads
already in a batch.

However there is no locking to ensure one thread doesn't set the flag
after it has just been cleared in another.  This does occasionally happen.

md/raid5 maintains a count of the number of stripe_heads with
STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes.  When
break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently
this could becomes incorrect and will never again return to zero.

md/raid5 delays the handling of some stripe_heads until
preread_active_stripes becomes zero.  So when the above mention race
happens, those stripe_heads become blocked and never progress,
resulting is write to the array handing.

So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE
in the members of a batch.

URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741
URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153
URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz
Reported-by: Martin Svec <martin.svec@zoner.cz> (and others)
Tested-by: Tom Weber <linux@junkyard.4t2.com>
Fixes: 1b956f7a8f ("md/raid5: be more selective about distributing flags across batch.")
Cc: stable@vger.kernel.org (v4.1 and later)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-09 09:31:41 -08:00
Eric Wheeler
f8b11260a4 bcache: fix cache_set_flush() NULL pointer dereference on OOM
When bch_cache_set_alloc() fails to kzalloc the cache_set, the
asyncronous closure handling tries to dereference a cache_set that
hadn't yet been allocated inside of cache_set_flush() which is called
by __cache_set_unregister() during cleanup.  This appears to happen only
during an OOM condition on bcache_register.

Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: stable@vger.kernel.org
2016-03-08 09:19:10 -07:00
Eric Wheeler
9b299728ed bcache: cleaned up error handling around register_cache()
Fix null pointer dereference by changing register_cache() to return an int
instead of being void.  This allows it to return -ENOMEM or -ENODEV and
enables upper layers to handle the OOM case without NULL pointer issues.

See this thread:
  http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3521

Fixes this error:
  gargamel:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register

  bcache: register_cache() error opening sdh2: cannot allocate memory
  BUG: unable to handle kernel NULL pointer dereference at 00000000000009b8
  IP: [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache]
  PGD 120dff067 PUD 1119a3067 PMD 0
  Oops: 0000 [#1] SMP
  Modules linked in: veth ip6table_filter ip6_tables
  (...)
  CPU: 4 PID: 3371 Comm: kworker/4:3 Not tainted 4.4.2-amd64-i915-volpreempt-20160213bc1 #3
  Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
  Workqueue: events cache_set_flush [bcache]
  task: ffff88020d5dc280 ti: ffff88020b6f8000 task.ti: ffff88020b6f8000
  RIP: 0010:[<ffffffffc05a7e8d>]  [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache]

Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Tested-by: Marc MERLIN <marc@merlins.org>
Cc: <stable@vger.kernel.org>
2016-03-08 09:19:08 -07:00
Eric Wheeler
07cc6ef8ed bcache: fix race of writeback thread starting before complete initialization
The bch_writeback_thread might BUG_ON in read_dirty() if
dc->sb==BDEV_STATE_DIRTY and bch_sectors_dirty_init has not yet completed
its related initialization.  This patch downs the dc->writeback_lock until
after initialization is complete, thus preventing bch_writeback_thread
from proceeding prematurely.

See this thread:
  http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3453

Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Tested-by: Marc MERLIN <marc@merlins.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-08 09:17:30 -07:00
Eric Engestrom
c97e0602bc md/bitmap: remove redundant check
daemon_sleep is an unsigned, so testing if it's 0 or less than 1 does
the same thing.

Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-07 09:30:16 -08:00
Shaohua Li
70d9798b95 MD: warn for potential deadlock
The personality thread shouldn't call mddev_suspend(). Because
mddev_suspend() will for all IO finish, but IO is handled in personality
thread, so this could cause deadlock. To trigger this early, add a
warning if mddev_suspend() is called from personality thread.

Suggested-by: NeilBrown <neilb@suse.com>
Cc: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26 09:44:57 -08:00
Sebastian Parschauer
399146b80e md: Drop sending a change uevent when stopping
When stopping an MD device, then its device node /dev/mdX may still
exist afterwards or it is recreated by udev. The next open() call
can lead to creation of an inoperable MD device. The reason for
this is that a change event (KOBJ_CHANGE) is sent to udev which
races against the remove event (KOBJ_REMOVE) from md_free().
So drop sending the change event.

A change is likely also required in mdadm as many versions send the
change event to udev as well.

Neil mentioned the change event is a workaround for old kernel
Commit: 934d9c23b4 ("md: destroy partitions and notify udev when md array is stopped.")
new mdadm can handle device remove now, so this isn't required any more.

Cc: NeilBrown <neilb@suse.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26 09:44:56 -08:00
Shaohua Li
6ab2a4b806 RAID5: revert e9e4c377e2 to fix a livelock
Revert commit
e9e4c377e2f563(md/raid5: per hash value and exclusive wait_for_stripe)

The problem is raid5_get_active_stripe waits on
conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes
in this order:
- release all stripes with hash 0
- raid5_get_active_stripe still sleeps since active_stripes >
  max_nr_stripes * 3 / 4
- release all stripes with hash other than 0. active_stripes becomes 0
- raid5_get_active_stripe still sleeps, since nobody wakes up
  wait_for_stripe[0]
The system live locks. The problem is active_stripes isn't a per-hash
count. Revert the patch makes the live lock go away.

Cc: stable@vger.kernel.org (v4.2+)
Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26 09:44:56 -08:00
Shaohua Li
27a353c026 RAID5: check_reshape() shouldn't call mddev_suspend
check_reshape() is called from raid5d thread. raid5d thread shouldn't
call mddev_suspend(), because mddev_suspend() waits for all IO finish
but IO is handled in raid5d thread, we could easily deadlock here.

This issue is introduced by
738a273 ("md/raid5: fix allocation of 'scribble' array.")

Cc: stable@vger.kernel.org (v4.1+)
Reported-and-tested-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26 09:44:11 -08:00
Jes Sorensen
e7597e69de md/raid5: Compare apples to apples (or sectors to sectors)
'max_discard_sectors' is in sectors, while 'stripe' is in bytes.

This fixes the problem where DISCARD would get disabled on some larger
RAID5 configurations (6 or more drives in my testing), while it worked
as expected with smaller configurations.

Fixes: 620125f2bf ("MD: raid5 trim support")
Cc: stable@vger.kernel.org v3.7+
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-25 16:38:53 -08:00
Mike Snitzer
9f54cec553 dm mpath: remove __pgpath_busy forward declaration, rename to pgpath_busy
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:44 -05:00
Mike Snitzer
be7d31cca8 dm mpath: switch from 'unsigned' to 'bool' for flags where appropriate
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:43 -05:00
Mike Snitzer
b0b477c7e0 dm round robin: use percpu 'repeat_count' and 'current_path'
Now that dm-mpath core is lockless in the per-IO fast path it is
critical, for performance, to have the .select_path hook
(rr_select_path) also be as lockless as possible.

The new percpu members of 'struct selector' allow for lockless support
of 'repeat_count' governed repeat use of a previously selected path.  If
a path fails while it is 'current_path' the worst case is concurrent IO
might be mapped to the failed path until the .fail_path hook
(rr_fail_path) is called.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:42 -05:00
Mike Snitzer
90a4323ccf dm path selector: remove 'repeat_count' return from .select_path hook
If a path selector has any use for a repeat_count it should be handled
locally and not depend on the dm-mpath core to be concerned with it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:42 -05:00
Mike Snitzer
9659f81144 dm mpath: push path selector locking down to path selectors
Proper locking of the lists used by the path selectors should be handled
within the selectors (relying on dm-mpath.c code's use of the m->lock
spinlock was reckless).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:41 -05:00
Mike Snitzer
21136f89d7 dm mpath: remove repeat_count support from multipath core
Preparation for making __multipath_map() avoid taking the m->lock
spinlock -- in favor of using RCU locking.

repeat_count was primarily for bio-based DM multipath's benefit.  There
is really no need for it anymore now that DM multipath is request-based.
As such, repeat_count > 1 is no longer honored and a warning is
displayed if the user attempts to use a value > 1.  This is a temporary
change for the round-robin path-selector (as a later commit will restore
its support for repeat_count > 1).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:40 -05:00
Mike Snitzer
7943bd6dd3 dm mpath: remove unnecessary casts in front of ti->private
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:40 -05:00
Mike Snitzer
78ce23b518 dm mpath: use blk_mq_alloc_request() and blk_mq_free_request() directly
There isn't any need to support both old .request_fn and blk-mq paths
in the blk-mq specific portion of __multipath_map().  Call
blk_mq_alloc_request() directly rather than use blk_get_request().

Similarly, call blk_mq_free_request(), rather than blk_put_request(), in
multipath_release_clone().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:39 -05:00
Mike Snitzer
2eff1924e1 dm mpath: cleanup 'struct dm_mpath_io' management code
Refactor and rename existing interfaces to be more specific and
self-documenting.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:39 -05:00
Mike Snitzer
8637a6bf14 dm mpath: use blk-mq pdu for per-request 'struct dm_mpath_io'
Allow the multipath target to avoid making small allocations for each
'struct dm_mpath_io' that is needed for each request.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:38 -05:00
Mike Snitzer
591ddcfc4b dm: allow immutable request-based targets to use blk-mq pdu
This will allow DM multipath to use a portion of the blk-mq pdu space
for target data (e.g. struct dm_mpath_io).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:37 -05:00
Mike Snitzer
30187e1d48 dm: rename target's per_bio_data_size to per_io_data_size
Request-based DM will also make use of per_bio_data_size.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:37 -05:00
Mike Snitzer
eca7ee6dc0 dm: distinquish old .request_fn (dm-old) vs dm-mq request-based DM
Rename various methods to have either a "dm_old" or "dm_mq" prefix.
Improve code comments to assist with understanding the duality of code
that handles both "dm_old" and "dm_mq" cases.

It is no much easier to quickly look at the code and _know_ that a given
method is either 1) "dm_old" only 2) "dm_mq" only 3) common to both.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:33 -05:00
Mike Snitzer
c5248f79f3 dm: remove support for stacking dm-mq on .request_fn device(s)
Remove all fiddley code that propped up this support for a blk-mq
request-queue ontop of all .request_fn devices.

Testing has proven this niche request-based dm-mq mode to be buggy, when
testing fault tolerance with DM multipath, and there is no point trying
to preserve it.

Should help improve efficiency of pure dm-mq code and make code
maintenance less delicate.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:33:46 -05:00
Mike Snitzer
818c5f3bef dm: fix a couple locking issues with use of block interfaces
old_stop_queue() was checking blk_queue_stopped() without holding the
q->queue_lock.

dm_requeue_original_request() needed to check blk_queue_stopped(), with
q->queue_lock held, before calling blk_mq_kick_requeue_list().  And a
side-effect of that change is start_queue() must also call
blk_mq_kick_requeue_list().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:33:09 -05:00
Mike Snitzer
1c357a1e86 dm: allocate blk_mq_tag_set rather than embed in mapped_device
The blk_mq_tag_set is only needed for dm-mq support.  There is point
wasting space in 'struct mapped_device' for non-dm-mq devices.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> # check kzalloc return
2016-02-22 12:07:14 -05:00
Mike Snitzer
faad87df4b dm: add 'dm_mq_nr_hw_queues' and 'dm_mq_queue_depth' module params
Allow user to change these values via module params or sysfs.

'dm_mq_nr_hw_queues' defaults to 1 (max 32).

'dm_mq_queue_depth' defaults to 2048 (up from 64, which proved far too
small under moderate sized workloads -- the dm-multipath device would
continuously block waiting for tags (requests) to become available).
The maximum is BLK_MQ_MAX_DEPTH (currently 10240).

Keep in mind the total number of pre-allocated requests per
request-based dm-mq device is 'dm_mq_nr_hw_queues' * 'dm_mq_queue_depth'
(currently 2048).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 12:07:10 -05:00
Mike Snitzer
c91852ff08 dm: optimize dm_request_fn()
DM multipath is the only request-based DM target -- which only supports
tables with a single target that is immutable.  Leverage this fact in
dm_request_fn().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:22 -05:00
Mike Snitzer
16f122661d dm: optimize dm_mq_queue_rq()
DM multipath is the only dm-mq target.  But that aside, request-based DM
only supports tables with a single target that is immutable.  Leverage
this fact in dm_mq_queue_rq() by using the 'immutable_target' stored in
the mapped_device when the table was made active.  This saves the need
to even take the read-side of the SRCU via dm_{get,put}_live_table.

If the active DM table does not have an immutable target (e.g. "error"
target was swapped in) then fallback to the slow-path where the target
is looked up from the live table.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:22 -05:00
Mike Snitzer
f083b09b78 dm: set DM_TARGET_WILDCARD feature on "error" target
The DM_TARGET_WILDCARD feature indicates that the "error" target may
replace any target; even immutable targets.  This feature will be useful
to preserve the ability to replace the "multipath" target even once it
is formally converted over to having the DM_TARGET_IMMUTABLE feature.

Also, implicit in the DM_TARGET_WILDCARD feature flag being set is that
.map, .map_rq, .clone_and_map_rq and .release_clone_rq are all defined
in the target_type.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:21 -05:00
Mike Snitzer
e522c03905 dm: cleanup dm_any_congested()
The request-based DM support for checking queue congestion doesn't
require access to the live DM table.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:20 -05:00
Mike Snitzer
ae6ad75e5c dm: remove unused dm_get_rq_mapinfo()
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:20 -05:00