Commit Graph

6477 Commits

Author SHA1 Message Date
Linus Torvalds
78d91a75b4 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Here's a pull request for 4.11-rc, fixing a set of issues mostly
  centered around the new scheduling framework. These have been brewing
  for a while, but split up into what we absolutely need in 4.11, and
  what we can defer until 4.12. These are well tested, on both single
  queue and multiqueue setups, and with and without shared tags. They
  fix several hangs that have happened in testing.

  This is obviously larger than I would have preferred at this point in
  time, but I don't think we can shave much off this and still get the
  desired results.

  In detail, this pull request contains:

   - a set of five fixes for NVMe, mostly from Christoph and one from
     Roland.

   - a series from Bart, fixing issues with dm-mq and SCSI shared tags
     and scheduling. Note that one of those patches commit messages may
     read like an optimization, but it is in fact an important fix for
     queue restarts in particular.

   - a series from Omar, most importantly fixing a hang with multiple
     hardware queues when we fail to get a driver tag. Another important
     fix in there is for resizing hardware queues, which nbd does when
     handling multiple sockets for one connection.

   - fixing an imbalance in putting the ctx for hctx request allocations
     from Minchan"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: Restart a single queue if tag sets are shared
  dm rq: Avoid that request processing stalls sporadically
  scsi: Avoid that SCSI queues get stuck
  blk-mq: Introduce blk_mq_delay_run_hw_queue()
  blk-mq: remap queues when adding/removing hardware queues
  blk-mq-sched: fix crash in switch error path
  blk-mq-sched: set up scheduler tags when bringing up new queues
  blk-mq-sched: refactor scheduler initialization
  blk-mq: use the right hctx when getting a driver tag fails
  nvmet: fix byte swap in nvmet_parse_io_cmd
  nvmet: fix byte swap in nvmet_execute_write_zeroes
  nvmet: add missing byte swap in nvmet_get_smart_log
  nvme: add missing byte swap in nvme_setup_discard
  nvme: Correct NVMF enum values to match NVMe-oF rev 1.0
  block: do not put mq context in blk_mq_alloc_request_hctx
2017-04-08 11:56:58 -07:00
Christoph Hellwig
48920ff2a5 block: remove the discard_zeroes_data flag
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Christoph Hellwig
615ec946ab dm kcopyd: switch to use REQ_OP_WRITE_ZEROES
It seems like the code currently passes whatever it was using for writes
to WRITE SAME.  Just switch it to WRITE ZEROES, although that doesn't
need any payload.

Untested, and confused by the code, maybe someone who understands it
better than me can help..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Christoph Hellwig
ac62d6208a dm: support REQ_OP_WRITE_ZEROES
Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Christoph Hellwig
0f5d690f7b dm io: discards don't take a payload
Fix up do_region to not allocate a bio_vec for discards.  We've
got rid of the discard payload allocated by the caller years ago.

Obviously this wasn't actually harmful given how long it's been
there, but it's still good to avoid the pointless allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Christoph Hellwig
3deff1a70d md: support REQ_OP_WRITE_ZEROES
Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Jens Axboe
65f619d253 Merge branch 'for-linus' into for-4.12/block
We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 12:45:20 -06:00
Bart Van Assche
6077c2d706 dm rq: Avoid that request processing stalls sporadically
While running the srp-test software I noticed that request
processing stalls sporadically at the beginning of a test, namely
when mkfs is run against a dm-mpath device. Every time when that
happened the following command was sufficient to resume request
processing:

    echo run >/sys/kernel/debug/block/dm-0/state

This patch avoids that such request processing stalls occur. The
test I ran is as follows:

    while srp-test/run_tests -d -r 30 -t 02-mq; do :; done

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 12:27:10 -06:00
Linus Torvalds
81d4bab4ce Merge tag 'dm-4.11-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:

 - two stable fixes for the verity target's FEC support

 - a stable fix for raid target's raid1 support (when no bitmap is used)

 - a 4.11 cache metadata v2 format fix to properly test blocks are clean

* tag 'dm-4.11-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm verity fec: fix bufio leaks
  dm raid: fix NULL pointer dereference for raid1 without bitmap
  dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirty
  dm verity fec: limit error correction recursion
2017-04-07 10:47:20 -07:00
NeilBrown
fbbaf700e7 block: trace completion of all bios.
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete().  Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete.  Only bio_endio() knows that.

So move the trace_block_bio_complete() call to bio_endio().

Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.

There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
   trace event at the 'request' level, there is no point generating
   one at the bio level too.  In this case the bi_sector and bi_size
   will have changed, so the bio level event would be wrong

2/ If the bio hasn't actually been queued yet, but is being aborted
   early, then a trace event could be confusing.  Some filesystems
   call bio_endio() but do not want tracing.

3/ The bio_integrity code interposes itself by replacing bi_end_io,
   then restoring it and calling bio_endio() again.  This would produce
   two identical trace events if left like that.

To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.

When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication.  A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component.  To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.

So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 09:40:52 -06:00
Sami Tolvanen
86e3e83b44 dm verity fec: fix bufio leaks
Buffers read through dm_bufio_read() were not released in all code paths.

Fixes: a739ff3f54 ("dm verity: add support for forward error correction")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-31 15:44:25 -04:00
Joe Thornber
cc7e394024 dm cache policy smq: make the cleaner policy write-back more aggressively
By ignoring the sentinels the cleaner policy is able to write-back dirty
cache data much faster.  There is no reason to respect the sentinels,
which denote that a block was changed recently, when using the cleaner
policy given that the cleaner is tasked with writing back all dirty
data.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-31 11:41:05 -04:00
Joe Thornber
449b668ce0 dm cache: set/clear the cache core's dirty_bitset when loading mappings
When loading metadata make sure to set/clear the dirty bits in the cache
core's dirty_bitset as well as the policy.

Otherwise the cache core is unaware that any blocks were dirty when the
cache was last shutdown.  A very serious side-effect being that the
cleaner policy would therefore never be tasked with writing back dirty
data from a cache that was in writeback mode (e.g. when switching from
smq policy to cleaner policy when decommissioning a writeback cache).

This fixes a serious data corruption bug associated with writeback mode.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-31 11:33:44 -04:00
Dmitry Bilunov
7a0c5c5b83 dm raid: fix NULL pointer dereference for raid1 without bitmap
Commit 4257e08 ("dm raid: support to change bitmap region size")
introduced a bitmap resize call during preresume phase. User can create
a DM device with "raid" target configured as raid1 with no metadata
devices to hold superblock/bitmap info. It can be achieved using the
following sequence:

  truncate -s 32M /dev/shm/raid-test
  LOOP=$(losetup --show -f /dev/shm/raid-test)
  dmsetup create raid-test-linear0 --table "0 1024 linear $LOOP 0"
  dmsetup create raid-test-linear1 --table "0 1024 linear $LOOP 1024"
  dmsetup create raid-test --table "0 1024 raid raid1 1 2048 2 - /dev/mapper/raid-test-linear0 - /dev/mapper/raid-test-linear1"

This results in the following crash:

[ 4029.110216] device-mapper: raid: Ignoring chunk size parameter for RAID 1
[ 4029.110217] device-mapper: raid: Choosing default region size of 4MiB
[ 4029.111349] md/raid1:mdX: active with 2 out of 2 mirrors
[ 4029.114770] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 4029.114802] IP: bitmap_resize+0x25/0x7c0 [md_mod]
[ 4029.114816] PGD 0
…
[ 4029.115059] Hardware name: Aquarius Pro P30 S85 BUY-866/B85M-E, BIOS 2304 05/25/2015
[ 4029.115079] task: ffff88015cc29a80 task.stack: ffffc90001a5c000
[ 4029.115097] RIP: 0010:bitmap_resize+0x25/0x7c0 [md_mod]
[ 4029.115112] RSP: 0018:ffffc90001a5fb68 EFLAGS: 00010246
[ 4029.115127] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 0000000000000000
[ 4029.115146] RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000000
[ 4029.115166] RBP: ffffc90001a5fc28 R08: 0000000800000000 R09: 00000008ffffffff
[ 4029.115185] R10: ffffea0005661600 R11: ffff88015cc29a80 R12: ffff88021231f058
[ 4029.115204] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 4029.115223] FS:  00007fe73a6b4740(0000) GS:ffff88021ea80000(0000) knlGS:0000000000000000
[ 4029.115245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4029.115261] CR2: 0000000000000030 CR3: 0000000159a74000 CR4: 00000000001426e0
[ 4029.115281] Call Trace:
[ 4029.115291]  ? raid_iterate_devices+0x63/0x80 [dm_raid]
[ 4029.115309]  ? dm_table_all_devices_attribute.isra.23+0x41/0x70 [dm_mod]
[ 4029.115329]  ? dm_table_set_restrictions+0x225/0x2d0 [dm_mod]
[ 4029.115346]  raid_preresume+0x81/0x2e0 [dm_raid]
[ 4029.115361]  dm_table_resume_targets+0x47/0xe0 [dm_mod]
[ 4029.115378]  dm_resume+0xa8/0xd0 [dm_mod]
[ 4029.115391]  dev_suspend+0x123/0x250 [dm_mod]
[ 4029.115405]  ? table_load+0x350/0x350 [dm_mod]
[ 4029.115419]  ctl_ioctl+0x1c2/0x490 [dm_mod]
[ 4029.115433]  dm_ctl_ioctl+0xe/0x20 [dm_mod]
[ 4029.115447]  do_vfs_ioctl+0x8d/0x5a0
[ 4029.115459]  ? ____fput+0x9/0x10
[ 4029.115470]  ? task_work_run+0x79/0xa0
[ 4029.115481]  SyS_ioctl+0x3c/0x70
[ 4029.115493]  entry_SYSCALL_64_fastpath+0x13/0x94

The raid_preresume() function incorrectly assumes that the raid_set has
a bitmap enabled if RT_FLAG_RS_BITMAP_LOADED is set.  But
RT_FLAG_RS_BITMAP_LOADED is getting set in __load_dirty_region_bitmap()
even if there is no bitmap present (and bitmap_load() happily returns 0
even if a bitmap isn't present).  So the only way forward in the
near-term is to check if the bitmap is present by seeing if
mddev->bitmap is not NULL after bitmap_load() has been called.

By doing so the above NULL pointer is avoided.

Fixes: 4257e08 ("dm raid: support to change bitmap region size")
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Dmitry Bilunov <kmeaw@yandex-team.ru>
Signed-off-by: Andrey Smetanin <asmetanin@yandex-team.ru>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-31 11:05:54 -04:00
Eric Biggers
f363b089be blk-mq: constify struct blk_mq_ops
Constify all instances of blk_mq_ops, as they are never modified.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-31 08:28:58 -06:00
Mikulas Patocka
7b81ef8b14 dm raid: select the Kconfig option CONFIG_MD_RAID0
Since the commit 0cf4503174 ("dm raid: add support for the MD RAID0
personality"), the dm-raid subsystem can activate a RAID-0 array.
Therefore, add MD_RAID0 to the dependencies of DM_RAID, so that MD_RAID0
will be selected when DM_RAID is selected.

Fixes: 0cf4503174 ("dm raid: add support for the MD RAID0 personality")
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-30 11:17:08 -04:00
Ming Lei
8fc04e6ea0 md: raid1: kill warning on powerpc_pseries
This patch kills the warning reported on powerpc_pseries,
and actually we don't need the initialization.

	After merging the md tree, today's linux-next build (powerpc
	pseries_le_defconfig) produced this warning:

	drivers/md/raid1.c: In function 'raid1d':
	drivers/md/raid1.c:2172:9: warning: 'page_len$' may be used uninitialized in this function [-Wmaybe-uninitialized]
	     if (memcmp(page_address(ppages[j]),
	         ^
	drivers/md/raid1.c:2160:7: note: 'page_len$' was declared here
	   int page_len[RESYNC_PAGES];
       ^

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-28 08:49:52 -07:00
SeongJae Park
4f6cce3910 Fix dead URLs to ftp.kernel.org
URLs to ftp.kernel.org are still exist though the service is closed [0].
This commit fixes the URLs to use www.kernel.org instead.

[0] https://www.kernel.org/shutting-down-ftp-services.html

Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-28 16:16:52 +02:00
Song Liu
0bb0c10500 md/raid5: use consistency_policy to remove journal feature
When journal device of an array fails, the array is forced into read-only
mode. To make the array normal without adding another journal device, we
need to remove journal _feature_ from the array.

This patch allows remove journal _feature_ from an array, For journal
existing journal should be either missing or faulty.

To remove journal feature, it is necessary to remove the journal device
first:

  mdadm --fail /dev/md0 /dev/sdb
  mdadm: set /dev/sdb faulty in /dev/md0
  mdadm --remove /dev/md0 /dev/sdb
  mdadm: hot removed /dev/sdb from /dev/md0

Then the journal feature can be removed by echoing into the sysfs file:

 cat /sys/block/md0/md/consistency_policy
 journal

 echo resync > /sys/block/md0/md/consistency_policy
 cat /sys/block/md0/md/consistency_policy
 resync

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-27 12:02:33 -07:00
Heinz Mauelshagen
6e53636fe8 dm raid: add raid4/5/6 journal write-back support via journal_mode option
Commit 63c32ed4af ("dm raid: add raid4/5/6 journaling support") added
journal support to close the raid4/5/6 "write hole" -- in terms of
writethrough caching.

Introduce a "journal_mode" feature and use the new
r5c_journal_mode_set() API to add support for switching the journal
device's cache mode between write-through (the current default) and
write-back.

NOTE: If the journal device is not layered on resilent storage and it
fails, write-through mode will cause the "write hole" to reoccur.  But
if the journal fails while in write-back mode it will cause data loss
for any dirty cache entries unless resilent storage is used for the
journal.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-27 12:08:07 -04:00
Heinz Mauelshagen
4464e36e06 dm raid: fix table line argument order in status
Commit 3a1c1ef2f ("dm raid: enhance status interface and fixup
takeover/raid0") added new table line arguments and introduced an
ordering flaw.  The sequence of the raid10_copies and raid10_format
raid parameters got reversed which causes lvm2 userspace to fail by
falsely assuming a changed table line.

Sequence those 2 parameters as before so that old lvm2 can function
properly with new kernels by adjusting the table line output as
documented in Documentation/device-mapper/dm-raid.txt.

Also, add missing version 1.10.1 highlight to the documention.

Fixes: 3a1c1ef2f ("dm raid: enhance status interface and fixup takeover/raid0")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-27 11:45:26 -04:00
Heinz Mauelshagen
78e470c26f md: add raid4/5/6 journal mode switching API
Commit 2ded370373 ("md/r5cache: State machine for raid5-cache write
back mode") added support for "write-back" caching on the raid journal
device.

In order to allow the dm-raid target to switch between the available
"write-through" and "write-back" modes, provide a new
r5c_journal_mode_set() API.

Use the new API in existing r5c_journal_mode_store()

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Shaohua Li <shli@fb.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-27 11:13:47 -04:00
Jason Yan
1ad45a9bc4 md/raid5-cache: fix payload endianness problem in raid5-cache
The payload->header.type and payload->size are little-endian, so just
convert them to the right byte order.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Cc: <stable@vger.kernel.org> #v4.10+
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-25 09:38:22 -07:00
Shaohua Li
41743c1f04 md/raid1: skip data copy for behind io for discard request
discard request doesn't have data attached, so it's meaningless to
allocate memory and copy from original bio for behind IO. And the copy
is bogus because bio_copy_data_partial can't handle discard request.

We don't support writesame/writezeros request so far.

Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-25 09:38:06 -07:00
Mikulas Patocka
ff3af92b44 dm crypt: use shifts instead of sector_div
sector_div is very slow, so we introduce a variable sector_shift and
use shift instead of sector_div.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:54:24 -04:00
Mikulas Patocka
c2bcb2b702 dm integrity: add recovery mode
In recovery mode, we don't:
- replay the journal
- check checksums
- allow writes to the device

This mode can be used as a last resort for data recovery.  The
motivation for recovery mode is that when there is a single error in the
journal, the user should not lose access to the whole device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:54:23 -04:00
Mike Snitzer
1aa0efd421 dm integrity: factor out create_journal() from dm_integrity_ctr()
Preparation for next commit that makes call to create_journal()
optional.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:54:22 -04:00
Milan Broz
8f0009a225 dm crypt: optionally support larger encryption sector size
Add  optional "sector_size"  parameter that specifies encryption sector
size (atomic unit of block device encryption).

Parameter can be in range 512 - 4096 bytes and must be power of two.
For compatibility reasons, the maximal IO must fit into the page limit,
so the limit is set to the minimal page size possible (4096 bytes).

NOTE: this device cannot yet be handled by cryptsetup if this parameter
is set.

IV for the sector is calculated from the 512 bytes sector offset unless
the iv_large_sectors option is used.

Test script using dmsetup:

  DEV="/dev/sdb"
  DEV_SIZE=$(blockdev --getsz $DEV)
  KEY="9c1185a5c5e9fc54612808977ee8f548b2258d31ddadef707ba62c166051b9e3cd0294c27515f2bccee924e8823ca6e124b8fc3167ed478bca702babe4e130ac"
  BLOCK_SIZE=4096

  # dmsetup create test_crypt --table "0 $DEV_SIZE crypt aes-xts-plain64 $KEY 0 $DEV 0 1 sector_size:$BLOCK_SIZE"
  # dmsetup table --showkeys test_crypt

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:54:21 -04:00
Milan Broz
33d2f09fcb dm crypt: introduce new format of cipher with "capi:" prefix
For the new authenticated encryption we have to support generic composed
modes (combination of encryption algorithm and authenticator) because
this is how the kernel crypto API accesses such algorithms.

To simplify the interface, we accept an algorithm directly in crypto API
format.  The new format is recognised by the "capi:" prefix.  The
dmcrypt internal IV specification is the same as for the old format.

The crypto API cipher specifications format is:
     capi:cipher_api_spec-ivmode[:ivopts]
Examples:
     capi:cbc(aes)-essiv:sha256 (equivalent to old aes-cbc-essiv:sha256)
     capi:xts(aes)-plain64      (equivalent to old aes-xts-plain64)
Examples of authenticated modes:
     capi:gcm(aes)-random
     capi:authenc(hmac(sha256),xts(aes))-random
     capi:rfc7539(chacha20,poly1305)-random

Authenticated modes can only be configured using the new cipher format.
Note that this format allows user to specify arbitrary combinations that
can be insecure. (Policy decision is done in cryptsetup userspace.)

Authenticated encryption algorithms can be of two types, either native
modes (like GCM) that performs both encryption and authentication
internally, or composed modes where user can compose AEAD with separate
specification of encryption algorithm and authenticator.

For composed mode with HMAC (length-preserving encryption mode like an
XTS and HMAC as an authenticator) we have to calculate HMAC digest size
(the separate authentication key is the same size as the HMAC digest).
Introduce crypt_ctr_auth_cipher() to parse the crypto API string to get
HMAC algorithm and retrieve digest size from it.

Also, for HMAC composed mode we need to parse the crypto API string to
get the cipher mode nested in the specification.  For native AEAD mode
(like GCM), we can use crypto_tfm_alg_name() API to get the cipher
specification.

Because the HMAC composed mode is not processed the same as the native
AEAD mode, the CRYPT_MODE_INTEGRITY_HMAC flag is no longer needed and
"hmac" specification for the table integrity argument is removed.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:54:20 -04:00
Milan Broz
e889f97a3e dm crypt: factor IV constructor out to separate function
No functional change.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:54:19 -04:00
Milan Broz
ef43aa3806 dm crypt: add cryptographic data integrity protection (authenticated encryption)
Allow the use of per-sector metadata, provided by the dm-integrity
module, for integrity protection and persistently stored per-sector
Initialization Vector (IV).  The underlying device must support the
"DM-DIF-EXT-TAG" dm-integrity profile.

The per-bio integrity metadata is allocated by dm-crypt for every bio.

Example of low-level mapping table for various types of use:
 DEV=/dev/sdb
 SIZE=417792

 # Additional HMAC with CBC-ESSIV, key is concatenated encryption key + HMAC key
 SIZE_INT=389952
 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 32 J 0"
 dmsetup create y --table "0 $SIZE_INT crypt aes-cbc-essiv:sha256 \
 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \
 00112233445566778899aabbccddeeff00112233445566778899aabbccddeeff \
 0 /dev/mapper/x 0 1 integrity:32:hmac(sha256)"

 # AEAD (Authenticated Encryption with Additional Data) - GCM with random IVs
 # GCM in kernel uses 96bits IV and we store 128bits auth tag (so 28 bytes metadata space)
 SIZE_INT=393024
 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 28 J 0"
 dmsetup create y --table "0 $SIZE_INT crypt aes-gcm-random \
 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \
 0 /dev/mapper/x 0 1 integrity:28:aead"

 # Random IV only for XTS mode (no integrity protection but provides atomic random sector change)
 SIZE_INT=401272
 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 16 J 0"
 dmsetup create y --table "0 $SIZE_INT crypt aes-xts-random \
 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \
 0 /dev/mapper/x 0 1 integrity:16:none"

 # Random IV with XTS + HMAC integrity protection
 SIZE_INT=377656
 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 48 J 0"
 dmsetup create y --table "0 $SIZE_INT crypt aes-xts-random \
 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \
 00112233445566778899aabbccddeeff00112233445566778899aabbccddeeff \
 0 /dev/mapper/x 0 1 integrity:48:hmac(sha256)"

Both AEAD and HMAC protection authenticates not only data but also
sector metadata.

HMAC protection is implemented through autenc wrapper (so it is
processed the same way as an authenticated mode).

In HMAC mode there are two keys (concatenated in dm-crypt mapping
table).  First is the encryption key and the second is the key for
authentication (HMAC).  (It is userspace decision if these keys are
independent or somehow derived.)

The sector request for AEAD/HMAC authenticated encryption looks like this:
 |----- AAD -------|------ DATA -------|-- AUTH TAG --|
 | (authenticated) | (auth+encryption) |              |
 | sector_LE |  IV |  sector in/out    |  tag in/out  |

For writes, the integrity fields are calculated during AEAD encryption
of every sector and stored in bio integrity fields and sent to
underlying dm-integrity target for storage.

For reads, the integrity metadata is verified during AEAD decryption of
every sector (they are filled in by dm-integrity, but the integrity
fields are pre-allocated in dm-crypt).

There is also an experimental support in cryptsetup utility for more
friendly configuration (part of LUKS2 format).

Because the integrity fields are not valid on initial creation, the
device must be "formatted".  This can be done by direct-io writes to the
device (e.g. dd in direct-io mode).  For now, there is available trivial
tool to do this, see: https://github.com/mbroz/dm_int_tools

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com>
Signed-off-by: Vashek Matyas <matyas@fi.muni.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:49:41 -04:00
Mikulas Patocka
7eada909bf dm: add integrity target
The dm-integrity target emulates a block device that has additional
per-sector tags that can be used for storing integrity information.

A general problem with storing integrity tags with every sector is that
writing the sector and the integrity tag must be atomic - i.e. in case of
crash, either both sector and integrity tag or none of them is written.

To guarantee write atomicity the dm-integrity target uses a journal. It
writes sector data and integrity tags into a journal, commits the journal
and then copies the data and integrity tags to their respective location.

The dm-integrity target can be used with the dm-crypt target - in this
situation the dm-crypt target creates the integrity data and passes them
to the dm-integrity target via bio_integrity_payload attached to the bio.
In this mode, the dm-crypt and dm-integrity targets provide authenticated
disk encryption - if the attacker modifies the encrypted device, an I/O
error is returned instead of random data.

The dm-integrity target can also be used as a standalone target, in this
mode it calculates and verifies the integrity tag internally. In this
mode, the dm-integrity target can be used to detect silent data
corruption on the disk or in the I/O path.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24 15:49:07 -04:00
Ming Lei
2d06e3b714 md: raid10: avoid direct access to bvec table in handle_reshape_read_error
All reshape I/O share pages from 1st copy device, so just use that pages
for avoiding direct access to bvec table in handle_reshape_read_error.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:37 -07:00
Ming Lei
cdb76be315 md: raid10: retrieve page from preallocated resync page array
Now one page array is allocated for each resync bio, and we can
retrieve page from this table directly.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:37 -07:00
Ming Lei
f025061836 md: raid10: don't use bio's vec table to manage resync pages
Now we allocate one page array for managing resync pages, instead
of using bio's vec table to do that, and the old way is very hacky
and won't work any more if multipage bvec is enabled.

The introduced cost is that we need to allocate (128 + 16) * copies
bytes per r10_bio, and it is fine because the inflight r10_bio for
resync shouldn't be much, as pointed by Shaohua.

Also bio_reset() in raid10_sync_request() and reshape_request()
are removed because all bios are freshly new now in these functions
and not necessary to reset any more.

This patch can be thought as cleanup too.

Suggested-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:37 -07:00
Ming Lei
81fa152008 md: raid10: refactor code of read reshape's .bi_end_io
reshape read request is a bit special and requires one extra
bio which isn't allocated from r10buf_pool.

Refactor the .bi_end_io for read reshape, so that we can use
raid10's resync page mangement approach easily in the following
patches.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:37 -07:00
Ming Lei
841c1316c7 md: raid1: improve write behind
This patch improve handling of write behind in the following ways:

- introduce behind master bio to hold all write behind pages
- fast clone bios from behind master bio
- avoid to change bvec table directly
- use bio_copy_data() and make code more clean

Suggested-by: Shaohua Li <shli@fb.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:37 -07:00
Ming Lei
d8c84c4f8b md: raid1: move 'offset' out of loop
The 'offset' local variable can't be changed inside the loop, so
move it out.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:37 -07:00
Ming Lei
60928a91b0 md: raid1: use bio helper in process_checks()
Avoid to direct access to bvec table.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:36 -07:00
Ming Lei
44cf0f4dc7 md: raid1: retrieve page from pre-allocated resync page array
Now one page array is allocated for each resync bio, and we can
retrieve page from this table directly.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:36 -07:00
Ming Lei
98d30c5812 md: raid1: don't use bio's vec table to manage resync pages
Now we allocate one page array for managing resync pages, instead
of using bio's vec table to do that, and the old way is very hacky
and won't work any more if multipage bvec is enabled.

The introduced cost is that we need to allocate (128 + 16) * raid_disks
bytes per r1_bio, and it is fine because the inflight r1_bio for
resync shouldn't be much, as pointed by Shaohua.

Also the bio_reset() in raid1_sync_request() is removed because
all bios are freshly new now and not necessary to reset any more.

This patch can be thought as a cleanup too

Suggested-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:36 -07:00
Ming Lei
a7234234d0 md: raid1: simplify r1buf_pool_free()
This patch gets each page's reference of each bio for resync,
then r1buf_pool_free() gets simplified a lot.

The same policy has been taken in raid10's buf pool allocation/free
too.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:36 -07:00
Ming Lei
513e2faa01 md: prepare for managing resync I/O pages in clean way
Now resync I/O use bio's bec table to manage pages,
this way is very hacky, and may not work any more
once multipage bvec is introduced.

So introduce helpers and new data structure for
managing resync I/O pages more cleanly.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:36 -07:00
Ming Lei
d8e29fbc3b md: move two macros into md.h
Both raid1 and raid10 share common resync
block size and page count, so move them into md.h.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:36 -07:00
Ming Lei
c85ba149de md: raid1/raid10: don't handle failure of bio_add_page()
All bio_add_page() is for adding one page into resync bio,
which is big enough to hold RESYNC_PAGES pages, and
the current bio_add_page() doesn't check queue limit any more,
so it won't fail at all.

remove unused label (shaohua)

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24 10:41:36 -07:00
Zhilong Liu
3560741e31 md: fix several trivial typos in comments
Signed-off-by: Zhilong Liu <zlliu@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-23 22:54:57 -07:00
Guoqing Jiang
27f26a0f37 md/raid10: refactor some codes from raid10_write_request
Previously, we clone both bio and repl_bio in raid10_write_request,
then add the cloned bio to plug->pending or conf->pending_bio_list
based on plug or not, and most of the logics are same for the two
conditions.

So introduce raid10_write_one_disk for it, and use replacement parameter
to distinguish the difference. No functional changes in the patch.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-23 22:42:14 -07:00
Dan Carpenter
0b408baf7f raid5-ppl: silence a misleading warning message
The "need_cache_flush" variable is never set to false.  When the
variable is true that means we print a warning message at the end of
the function.

Fixes: 3418d036c8 ("raid5-ppl: Partial Parity Log write logging implementation")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-23 22:38:46 -07:00
NeilBrown
4ad23a9764 MD: use per-cpu counter for writes_pending
The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean".  Consequently it needs to be updated frequently
but only checked for zero occasionally.  Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio.  This provided
justification for making the updates more efficient.

So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed.  As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.

We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero.  md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode.  If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.

It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.

mod_timer() already optimizes the case where the timeout value doesn't
actually change.  We leverage that further by always rounding off the
jiffies to the timeout value.  This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.

md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set.  That too can be optimised to avoid
calls to wake_up().

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-22 19:18:56 -07:00
NeilBrown
55cc39f345 md: close a race with setting mddev->in_sync
If ->in_sync is being set just as md_write_start() is being called,
it is possible that set_in_sync() won't see the elevated
->writes_pending, and md_write_start() won't see the set ->in_sync.

To close this race, re-test ->writes_pending after setting ->in_sync,
and add memory barriers to ensure the increment of ->writes_pending
will be seen by the time of this second test, or the new ->in_sync
will be seen by md_write_start().

Add a spinlock to array_state_show() to ensure this temporary
instability is never visible from userspace.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-22 19:18:30 -07:00