No need to dereference dentry twice to get the name when we already have
it stored in a local variable.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
This comment with a mysterious unfinished line confuses the kernel-doc
system since, because it starts with /**, it thinks it is documenting a
function.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Log error only when silent flag is not set.
Fixes: dbe6460388bc ("fs/befs/linuxvfs.c: check silent flag before logging errors")
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Acked-by: Salah Triki <salah.triki@gmail.com>
Merge updates from Andrew Morton:
- fsnotify updates
- ocfs2 updates
- all of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
console: don't prefer first registered if DT specifies stdout-path
cred: simpler, 1D supplementary groups
CREDITS: update Pavel's information, add GPG key, remove snail mail address
mailmap: add Johan Hovold
.gitattributes: set git diff driver for C source code files
uprobes: remove function declarations from arch/{mips,s390}
spelling.txt: "modeled" is spelt correctly
nmi_backtrace: generate one-line reports for idle cpus
arch/tile: adopt the new nmi_backtrace framework
nmi_backtrace: do a local dump_stack() instead of a self-NMI
nmi_backtrace: add more trigger_*_cpu_backtrace() methods
min/max: remove sparse warnings when they're nested
Documentation/filesystems/proc.txt: add more description for maps/smaps
mm, proc: fix region lost in /proc/self/smaps
proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
proc: add LSM hook checks to /proc/<tid>/timerslack_ns
proc: relax /proc/<tid>/timerslack_ns capability requirements
meminfo: break apart a very long seq_printf with #ifdefs
seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
proc: faster /proc/*/status
...
Current supplementary groups code can massively overallocate memory and
is implemented in a way so that access to individual gid is done via 2D
array.
If number of gids is <= 32, memory allocation is more or less tolerable
(140/148 bytes). But if it is not, code allocates full page (!)
regardless and, what's even more fun, doesn't reuse small 32-entry
array.
2D array means dependent shifts, loads and LEAs without possibility to
optimize them (gid is never known at compile time).
All of the above is unnecessary. Switch to the usual
trailing-zero-len-array scheme. Memory is allocated with
kmalloc/vmalloc() and only as much as needed. Accesses become simpler
(LEA 8(gi,idx,4) or even without displacement).
Maximum number of gids is 65536 which translates to 256KB+8 bytes. I
think kernel can handle such allocation.
On my usual desktop system with whole 9 (nine) aux groups, struct
group_info shrinks from 148 bytes to 44 bytes, yay!
Nice side effects:
- "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing,
- fix little mess in net/ipv4/ping.c
should have been using GROUP_AT macro but this point becomes moot,
- aux group allocation is persistent and should be accounted as such.
Link: http://lkml.kernel.org/r/20160817201927.GA2096@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
more detailed info please refer to:
https://bugzilla.redhat.com/show_bug.cgi?id=1365721
Actually, this bug is not only for NVDIMM/DAX but also for any other
file systems. This simple test case abstracted from nvml can easily
reproduce this bug in common environment:
-------------------------- testcase.c -----------------------------
int
is_pmem_proc(const void *addr, size_t len)
{
const char *caddr = addr;
FILE *fp;
if ((fp = fopen("/proc/self/smaps", "r")) == NULL) {
printf("!/proc/self/smaps");
return 0;
}
int retval = 0; /* assume false until proven otherwise */
char line[PROCMAXLEN]; /* for fgets() */
char *lo = NULL; /* beginning of current range in smaps file */
char *hi = NULL; /* end of current range in smaps file */
int needmm = 0; /* looking for mm flag for current range */
while (fgets(line, PROCMAXLEN, fp) != NULL) {
static const char vmflags[] = "VmFlags:";
static const char mm[] = " wr";
/* check for range line */
if (sscanf(line, "%p-%p", &lo, &hi) == 2) {
if (needmm) {
/* last range matched, but no mm flag found */
printf("never found mm flag.\n");
break;
} else if (caddr < lo) {
/* never found the range for caddr */
printf("#######no match for addr %p.\n", caddr);
break;
} else if (caddr < hi) {
/* start address is in this range */
size_t rangelen = (size_t)(hi - caddr);
/* remember that matching has started */
needmm = 1;
/* calculate remaining range to search for */
if (len > rangelen) {
len -= rangelen;
caddr += rangelen;
printf("matched %zu bytes in range "
"%p-%p, %zu left over.\n",
rangelen, lo, hi, len);
} else {
len = 0;
printf("matched all bytes in range "
"%p-%p.\n", lo, hi);
}
}
} else if (needmm && strncmp(line, vmflags,
sizeof(vmflags) - 1) == 0) {
if (strstr(&line[sizeof(vmflags) - 1], mm) != NULL) {
printf("mm flag found.\n");
if (len == 0) {
/* entire range matched */
retval = 1;
break;
}
needmm = 0; /* saw what was needed */
} else {
/* mm flag not set for some or all of range */
printf("range has no mm flag.\n");
break;
}
}
}
fclose(fp);
printf("returning %d.\n", retval);
return retval;
}
void *Addr;
size_t Size;
/*
* worker -- the work each thread performs
*/
static void *
worker(void *arg)
{
int *ret = (int *)arg;
*ret = is_pmem_proc(Addr, Size);
return NULL;
}
int main(int argc, char *argv[])
{
if (argc < 2 || argc > 3) {
printf("usage: %s file [env].\n", argv[0]);
return -1;
}
int fd = open(argv[1], O_RDWR);
struct stat stbuf;
fstat(fd, &stbuf);
Size = stbuf.st_size;
Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
close(fd);
pthread_t threads[NTHREAD];
int ret[NTHREAD];
/* kick off NTHREAD threads */
for (int i = 0; i < NTHREAD; i++)
pthread_create(&threads[i], NULL, worker, &ret[i]);
/* wait for all the threads to complete */
for (int i = 0; i < NTHREAD; i++)
pthread_join(threads[i], NULL);
/* verify that all the threads return the same value */
for (int i = 1; i < NTHREAD; i++) {
if (ret[0] != ret[i]) {
printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
ret[0], ret[i]);
}
}
printf("%d", ret[0]);
return 0;
}
It failed as some threads can not find the memory region in
"/proc/self/smaps" which is allocated in the main process
It is caused by proc fs which uses 'file->version' to indicate the VMA that
is the last one has already been handled by read() system call. When the
next read() issues, it uses the 'version' to find the VMA, then the next
VMA is what we want to handle, the related code is as follows:
if (last_addr) {
vma = find_vma(mm, last_addr);
if (vma && (vma = m_next_vma(priv, vma)))
return vma;
}
However, VMA will be lost if the last VMA is gone, e.g:
The process VMA list is A->B->C->D
CPU 0 CPU 1
read() system call
handle VMA B
version = B
return to userspace
unmap VMA B
issue read() again to continue to get
the region info
find_vma(version) will get VMA C
m_next_vma(C) will get VMA D
handle D
!!! VMA C is lost !!!
In order to fix this bug, we make 'file->version' indicate the end address
of the current VMA. m_start will then look up a vma which with vma_start
< last_vm_end and moves on to the next vma if we found the same or an
overlapping vma. This will guarantee that we will not miss an exclusive
vma but we can still miss one if the previous vma was shrunk. This is
acceptable because guaranteeing "never miss a vma" is simply not feasible.
User has to cope with some inconsistencies if the file is not read in one
go.
[mhocko@suse.com: changelog fixes]
Link: http://lkml.kernel.org/r/1475296958-27652-1-git-send-email-robert.hu@intel.com
Acked-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Robert Hu <robert.hu@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When an interface to allow a task to change another tasks timerslack was
first proposed, it was suggested that something greater then
CAP_SYS_NICE would be needed, as a task could be delayed further then
what normally could be done with nice adjustments.
So CAP_SYS_PTRACE was adopted instead for what became the
/proc/<tid>/timerslack_ns interface. However, for Android (where this
feature originates), giving the system_server CAP_SYS_PTRACE would allow
it to observe and modify all tasks memory. This is considered too high
a privilege level for only needing to change the timerslack.
After some discussion, it was realized that a CAP_SYS_NICE process can
set a task as SCHED_FIFO, so they could fork some spinning processes and
set them all SCHED_FIFO 99, in effect delaying all other tasks for an
infinite amount of time.
So as a CAP_SYS_NICE task can already cause trouble for other tasks,
using it as a required capability for accessing and modifying
/proc/<tid>/timerslack_ns seems sufficient.
Thus, this patch loosens the capability requirements to CAP_SYS_NICE and
removes CAP_SYS_PTRACE, simplifying some of the code flow as well.
This is technically an ABI change, but as the feature just landed in
4.6, I suspect no one is yet using it.
Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Nick Kralevich <nnk@google.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Colin Cross <ccross@android.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The global zero page is used to satisfy an anonymous read fault. If
THP(Transparent HugePage) is enabled then the global huge zero page is
used. The global huge zero page uses an atomic counter for reference
counting and is allocated/freed dynamically according to its counter
value.
CPU time spent on that counter will greatly increase if there are a lot
of processes doing anonymous read faults. This patch proposes a way to
reduce the access to the global counter so that the CPU load can be
reduced accordingly.
To do this, a new flag of the mm_struct is introduced:
MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
the global counter in two cases:
1 The first time it uses the global huge zero page;
2 The time when mm_user of its mm_struct reaches zero.
Note that right now, the huge zero page is eligible to be freed as soon
as its last use goes away. With this patch, the page will not be
eligible to be freed until the exit of the last process from which it
was ever used.
And with the use of mm_user, the kthread is not eligible to use huge
zero page either. Since no kthread is using huge zero page today, there
is no difference after applying this patch. But if that is not desired,
I can change it to when mm_count reaches zero.
Case used for test on Haswell EP:
usemem -n 72 --readonly -j 0x200000 100G
Which spawns 72 processes and each will mmap 100G anonymous space and
then do read only access to that space sequentially with a step of 2MB.
CPU cycles from perf report for base commit:
54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
CPU cycles from perf report for this commit:
0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page
Performance(throughput) of the workload for base commit: 1784430792
Performance(throughput) of the workload for this commit: 4726928591
164% increase.
Runtime of the workload for base commit: 707592 us
Runtime of the workload for this commit: 303970 us
50% drop.
Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trying to walk all of virtual memory requires architecture specific
knowledge. On x86_64, addresses must be sign extended from bit 48,
whereas on arm64 the top VA_BITS of address space have their own set of
page tables.
clear_refs_write() calls walk_page_range() on the range 0 to ~0UL, it
provides a test_walk() callback that only expects to be walking over
VMAs. Currently walk_pmd_range() will skip memory regions that don't
have a VMA, reporting them as a hole.
As this call only expects to walk user address space, make it walk 0 to
'highest_vm_end'.
Link: http://lkml.kernel.org/r/1472655792-22439-1-git-send-email-james.morse@arm.com
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fanotify code has its own lock (access_lock) to protect a list of events
waiting for a response from userspace.
However this is somewhat awkward as the same list_head in the event is
protected by notification_lock if it is part of the notification queue
and by access_lock if it is part of the fanotify private queue which
makes it difficult for any reliable checks in the generic code. So make
fanotify use the same lock - notification_lock - for protecting its
private event list.
Link: http://lkml.kernel.org/r/1473797711-14111-6-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fsnotify_flush_notify() and fanotify_release() destroy notification
event while holding notification_mutex.
The destruction of fanotify event includes a path_put() call which may
end up calling into a filesystem to delete an inode if we happen to be
the last holders of dentry reference which happens to be the last holder
of inode reference.
That in turn may violate lock ordering for some filesystems since
notification_mutex is also acquired e. g. during write when generating
fanotify event.
Also this is the only thing that forces notification_mutex to be a
sleeping lock. So drop notification_mutex before destroying a
notification event.
Link: http://lkml.kernel.org/r/1473797711-14111-4-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All filesystems that support xattrs by now do so via xattr handlers.
They all define sb->s_xattr, and their getxattr, setxattr, and
removexattr inode operations use the generic inode operations. On
filesystems that don't support xattrs, the xattr inode operations are
all NULL, and sb->s_xattr is also NULL.
This means that we can remove the getxattr, setxattr, and removexattr
inode operations and directly call the generic handlers, or better,
inline expand those handlers into fs/xattr.c.
Filesystems that do not support xattrs on some inodes should clear the
IOP_XATTR i_opflags flag in those inodes. (Right now, some filesystems
have checks to disable xattrs on some inodes in the ->list, ->get, and
->set xattr handler operations instead.) The IOP_XATTR flag is
automatically cleared in inodes of filesystems that don't have xattr
support.
In orangefs, symlinks do have a setxattr iop but no getxattr iop. Add a
check for symlinks to orangefs_inode_getxattr to preserve the current,
weird behavior; that check may not be necessary though.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When an inode doesn't support xattrs, turn listxattr off as well.
(When xattrs are "turned off", the VFS still passes security xattr
operations through to security modules, which can still expose inode
security labels that way.)
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Right now, various places in the kernel check for the existence of
getxattr, setxattr, and removexattr inode operations and directly call
those operations. Switch to helper functions and test for the IOP_XATTR
flag instead.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of special xattr inode operations, use the IOP_XATTR inode
operations flag for the special libfs empty directories.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With this change, all the xattr handler based operations will produce an
-EIO result for bad inodes, and we no longer only depend on inode->i_op
to be set to bad_inode_ops.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The IOP_XATTR inode operations flag in inode->i_opflags indicates that
the inode has xattr support. The flag is automatically set by
new_inode() on filesystems with xattr support (where sb->s_xattr is
defined), and cleared otherwise. Filesystems can explicitly clear it
for inodes that should not have xattr support.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull VFS splice updates from Al Viro:
"There's a bunch of branches this cycle, both mine and from other folks
and I'd rather send pull requests separately.
This one is the conversion of ->splice_read() to ITER_PIPE iov_iter
(and introduction of such). Gets rid of a lot of code in fs/splice.c
and elsewhere; there will be followups, but these are for the next
cycle... Some pipe/splice-related cleanups from Miklos in the same
branch as well"
* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
pipe: fix comment in pipe_buf_operations
pipe: add pipe_buf_steal() helper
pipe: add pipe_buf_confirm() helper
pipe: add pipe_buf_release() helper
pipe: add pipe_buf_get() helper
relay: simplify relay_file_read()
switch default_file_splice_read() to use of pipe-backed iov_iter
switch generic_file_splice_read() to use of ->read_iter()
new iov_iter flavour: pipe-backed
fuse_dev_splice_read(): switch to add_to_pipe()
skb_splice_bits(): get rid of callback
new helper: add_to_pipe()
splice: lift pipe_lock out of splice_to_pipe()
splice: switch get_iovec_page_array() to iov_iter
splice_to_pipe(): don't open-code wakeup_pipe_readers()
consistent treatment of EFAULT on O_DIRECT read/write
Pull ext4 updates from Ted Ts'o:
"Lots of bug fixes and cleanups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
ext4: remove unused variable
ext4: use journal inode to determine journal overhead
ext4: create function to read journal inode
ext4: unmap metadata when zeroing blocks
ext4: remove plugging from ext4_file_write_iter()
ext4: allow unlocked direct IO when pages are cached
ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY
fscrypto: use standard macros to compute length of fname ciphertext
ext4: do not unnecessarily null-terminate encrypted symlink data
ext4: release bh in make_indexed_dir
ext4: Allow parallel DIO reads
ext4: allow DAX writeback for hole punch
jbd2: fix lockdep annotation in add_transaction_credits()
blockgroup_lock.h: simplify definition of NR_BG_LOCKS
blockgroup_lock.h: remove debris from bgl_lock_ptr() conversion
fscrypto: make filename crypto functions return 0 on success
fscrypto: rename completion callbacks to reflect usage
fscrypto: remove unnecessary includes
fscrypto: improved validation when loading inode encryption metadata
ext4: fix memory leak when symlink decryption fails
...
Pull block layer updates from Jens Axboe:
"This is the main pull request for block layer changes in 4.9.
As mentioned at the last merge window, I've changed things up and now
do just one branch for core block layer changes, and driver changes.
This avoids dependencies between the two branches. Outside of this
main pull request, there are two topical branches coming as well.
This pull request contains:
- A set of fixes, and a conversion to blk-mq, of nbd. From Josef.
- Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
Followup dependency fix from Geert.
- General fixes from Bart, Baoyou, Guoqing, and Linus W.
- CFQ async write starvation fix from Glauber.
- Add supprot for delayed kick of the requeue list, from Mike.
- Pull out the scalable bitmap code from blk-mq-tag.c and make it
generally available under the name of sbitmap. Only blk-mq-tag uses
it for now, but the blk-mq scheduling bits will use it as well.
From Omar.
- bdev thaw error progagation from Pierre.
- Improve the blk polling statistics, and allow the user to clear
them. From Stephen.
- Set of minor cleanups from Christoph in block/blk-mq.
- Set of cleanups and optimizations from me for block/blk-mq.
- Various nvme/nvmet/nvmeof fixes from the various folks"
* 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
fs/block_dev.c: return the right error in thaw_bdev()
nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
nvme/scsi: Remove power management support
nvmet: Make dsm number of ranges zero based
nvmet: Use direct IO for writes
admin-cmd: Added smart-log command support.
nvme-fabrics: Add host_traddr options field to host infrastructure
nvme-fabrics: revise host transport option descriptions
nvme-fabrics: rework nvmf_get_address() for variable options
nbd: use BLK_MQ_F_BLOCKING
blkcg: Annotate blkg_hint correctly
cfq: fix starvation of asynchronous writes
blk-mq: add flag for drivers wanting blocking ->queue_rq()
blk-mq: remove non-blocking pass in blk_mq_map_request
blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
block: export bio_free_pages to other modules
lightnvm: propagate device_add() error code
lightnvm: expose device geometry through sysfs
lightnvm: control life of nvm_dev in driver
blk-mq: register device instead of disk
...