Merge tag 'docs-5.1' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "A fairly routine cycle for docs - lots of typo fixes, some new
  documents, and more translations. There's also some LICENSES
  adjustments from Thomas"

* tag 'docs-5.1' of git://git.lwn.net/linux: (74 commits)
  docs: Bring some order to filesystem documentation
  Documentation/locking/lockdep: Drop last two chars of sample states
  doc: rcu: Suspicious RCU usage is a warning
  docs: driver-api: iio: fix errors in documentation
  Documentation/process/howto: Update for 4.x -> 5.x versioning
  docs: Explicitly state that the 'Fixes:' tag shouldn't split lines
  doc: security: Add kern-doc for lsm_hooks.h
  doc: sctp: Merge and clean up rst files
  Docs: Correct /proc/stat path
  scripts/spdxcheck.py: fix C++ comment style detection
  doc: fix typos in license-rules.rst
  Documentation: fix admin-guide/README.rst minimum gcc version requirement
  doc: process: complete removal of info about -git patches
  doc: translations: sync translations 'remove info about -git patches'
  perf-security: wrap paragraphs on 72 columns
  perf-security: elaborate on perf_events/Perf privileged users
  perf-security: document collected perf_events/Perf data categories
  perf-security: document perf_events/Perf resource control
  sysfs.txt: add note on available attribute macros
  docs: kernel-doc: typo "if ... if" -> "if ... is"
  ...
This commit is contained in:
Linus Torvalds
2019-03-09 09:56:17 -08:00
83 changed files with 3741 additions and 1134 deletions

View File

@@ -0,0 +1,150 @@
=============================
Linux Filesystems API summary
=============================
This section contains API-level documentation, mostly taken from the source
code itself.
The Linux VFS
=============
The Filesystem types
--------------------
.. kernel-doc:: include/linux/fs.h
:internal:
The Directory Cache
-------------------
.. kernel-doc:: fs/dcache.c
:export:
.. kernel-doc:: include/linux/dcache.h
:internal:
Inode Handling
--------------
.. kernel-doc:: fs/inode.c
:export:
.. kernel-doc:: fs/bad_inode.c
:export:
Registration and Superblocks
----------------------------
.. kernel-doc:: fs/super.c
:export:
File Locks
----------
.. kernel-doc:: fs/locks.c
:export:
.. kernel-doc:: fs/locks.c
:internal:
Other Functions
---------------
.. kernel-doc:: fs/mpage.c
:export:
.. kernel-doc:: fs/namei.c
:export:
.. kernel-doc:: fs/buffer.c
:export:
.. kernel-doc:: block/bio.c
:export:
.. kernel-doc:: fs/seq_file.c
:export:
.. kernel-doc:: fs/filesystems.c
:export:
.. kernel-doc:: fs/fs-writeback.c
:export:
.. kernel-doc:: fs/block_dev.c
:export:
.. kernel-doc:: fs/anon_inodes.c
:export:
.. kernel-doc:: fs/attr.c
:export:
.. kernel-doc:: fs/d_path.c
:export:
.. kernel-doc:: fs/dax.c
:export:
.. kernel-doc:: fs/direct-io.c
:export:
.. kernel-doc:: fs/file_table.c
:export:
.. kernel-doc:: fs/libfs.c
:export:
.. kernel-doc:: fs/posix_acl.c
:export:
.. kernel-doc:: fs/stat.c
:export:
.. kernel-doc:: fs/sync.c
:export:
.. kernel-doc:: fs/xattr.c
:export:
The proc filesystem
===================
sysctl interface
----------------
.. kernel-doc:: kernel/sysctl.c
:export:
proc filesystem interface
-------------------------
.. kernel-doc:: fs/proc/base.c
:internal:
Events based on file descriptors
================================
.. kernel-doc:: fs/eventfd.c
:export:
The Filesystem for Exporting Kernel Objects
===========================================
.. kernel-doc:: fs/sysfs/file.c
:export:
.. kernel-doc:: fs/sysfs/symlink.c
:export:
The debugfs filesystem
======================
debugfs interface
-----------------
.. kernel-doc:: fs/debugfs/inode.c
:export:
.. kernel-doc:: fs/debugfs/file.c
:export:

View File

@@ -0,0 +1,68 @@
.. SPDX-License-Identifier: GPL-2.0
The Android binderfs Filesystem
===============================
Android binderfs is a filesystem for the Android binder IPC mechanism. It
allows to dynamically add and remove binder devices at runtime. Binder devices
located in a new binderfs instance are independent of binder devices located in
other binderfs instances. Mounting a new binderfs instance makes it possible
to get a set of private binder devices.
Mounting binderfs
-----------------
Android binderfs can be mounted with::
mkdir /dev/binderfs
mount -t binder binder /dev/binderfs
at which point a new instance of binderfs will show up at ``/dev/binderfs``.
In a fresh instance of binderfs no binder devices will be present. There will
only be a ``binder-control`` device which serves as the request handler for
binderfs. Mounting another binderfs instance at a different location will
create a new and separate instance from all other binderfs mounts. This is
identical to the behavior of e.g. ``devpts`` and ``tmpfs``. The Android
binderfs filesystem can be mounted in user namespaces.
Options
-------
max
binderfs instances can be mounted with a limit on the number of binder
devices that can be allocated. The ``max=<count>`` mount option serves as
a per-instance limit. If ``max=<count>`` is set then only ``<count>`` number
of binder devices can be allocated in this binderfs instance.
Allocating binder Devices
-------------------------
.. _ioctl: http://man7.org/linux/man-pages/man2/ioctl.2.html
To allocate a new binder device in a binderfs instance a request needs to be
sent through the ``binder-control`` device node. A request is sent in the form
of an `ioctl() <ioctl_>`_.
What a program needs to do is to open the ``binder-control`` device node and
send a ``BINDER_CTL_ADD`` request to the kernel. Users of binderfs need to
tell the kernel which name the new binder device should get. By default a name
can only contain up to ``BINDERFS_MAX_NAME`` chars including the terminating
zero byte.
Once the request is made via an `ioctl() <ioctl_>`_ passing a ``struct
binder_device`` with the name to the kernel it will allocate a new binder
device and return the major and minor number of the new device in the struct
(This is necessary because binderfs allocates a major device number
dynamically.). After the `ioctl() <ioctl_>`_ returns there will be a new
binder device located under /dev/binderfs with the chosen name.
Deleting binder Devices
-----------------------
.. _unlink: http://man7.org/linux/man-pages/man2/unlink.2.html
.. _rm: http://man7.org/linux/man-pages/man1/rm.1.html
Binderfs binder devices can be deleted via `unlink() <unlink_>`_. This means
that the `rm() <rm_>`_ tool can be used to delete them. Note that the
``binder-control`` device cannot be deleted since this would make the binderfs
instance unuseable. The ``binder-control`` device will be deleted when the
binderfs instance is unmounted and all references to it have been dropped.

View File

@@ -1,382 +1,43 @@
=====================
Linux Filesystems API
=====================
===============================
Filesystems in the Linux kernel
===============================
The Linux VFS
=============
This under-development manual will, some glorious day, provide
comprehensive information on how the Linux virtual filesystem (VFS) layer
works, along with the filesystems that sit below it. For now, what we have
can be found below.
The Filesystem types
--------------------
.. kernel-doc:: include/linux/fs.h
:internal:
The Directory Cache
-------------------
.. kernel-doc:: fs/dcache.c
:export:
.. kernel-doc:: include/linux/dcache.h
:internal:
Inode Handling
--------------
.. kernel-doc:: fs/inode.c
:export:
.. kernel-doc:: fs/bad_inode.c
:export:
Registration and Superblocks
----------------------------
.. kernel-doc:: fs/super.c
:export:
File Locks
----------
.. kernel-doc:: fs/locks.c
:export:
.. kernel-doc:: fs/locks.c
:internal:
Other Functions
---------------
.. kernel-doc:: fs/mpage.c
:export:
.. kernel-doc:: fs/namei.c
:export:
.. kernel-doc:: fs/buffer.c
:export:
.. kernel-doc:: block/bio.c
:export:
.. kernel-doc:: fs/seq_file.c
:export:
.. kernel-doc:: fs/filesystems.c
:export:
.. kernel-doc:: fs/fs-writeback.c
:export:
.. kernel-doc:: fs/block_dev.c
:export:
.. kernel-doc:: fs/anon_inodes.c
:export:
.. kernel-doc:: fs/attr.c
:export:
.. kernel-doc:: fs/d_path.c
:export:
.. kernel-doc:: fs/dax.c
:export:
.. kernel-doc:: fs/direct-io.c
:export:
.. kernel-doc:: fs/file_table.c
:export:
.. kernel-doc:: fs/libfs.c
:export:
.. kernel-doc:: fs/posix_acl.c
:export:
.. kernel-doc:: fs/stat.c
:export:
.. kernel-doc:: fs/sync.c
:export:
.. kernel-doc:: fs/xattr.c
:export:
The proc filesystem
===================
sysctl interface
----------------
.. kernel-doc:: kernel/sysctl.c
:export:
proc filesystem interface
-------------------------
.. kernel-doc:: fs/proc/base.c
:internal:
Events based on file descriptors
================================
.. kernel-doc:: fs/eventfd.c
:export:
The Filesystem for Exporting Kernel Objects
===========================================
.. kernel-doc:: fs/sysfs/file.c
:export:
.. kernel-doc:: fs/sysfs/symlink.c
:export:
The debugfs filesystem
Core VFS documentation
======================
debugfs interface
-----------------
.. kernel-doc:: fs/debugfs/inode.c
:export:
.. kernel-doc:: fs/debugfs/file.c
:export:
The Linux Journalling API
=========================
Overview
--------
Details
~~~~~~~
The journalling layer is easy to use. You need to first of all create a
journal_t data structure. There are two calls to do this dependent on
how you decide to allocate the physical media on which the journal
resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in
filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used
for journal stored on a raw device (in a continuous range of blocks). A
journal_t is a typedef for a struct pointer, so when you are finally
finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up
any used kernel memory.
Once you have got your journal_t object you need to 'mount' or load the
journal file. The journalling layer expects the space for the journal
was already allocated and initialized properly by the userspace tools.
When loading the journal you must call :c:func:`jbd2_journal_load` to process
journal contents. If the client file system detects the journal contents
does not need to be processed (or even need not have valid contents), it
may call :c:func:`jbd2_journal_wipe` to clear the journal contents before
calling :c:func:`jbd2_journal_load`.
Note that jbd2_journal_wipe(..,0) calls
:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding
transactions in the journal and similarly :c:func:`jbd2_journal_load` will
call :c:func:`jbd2_journal_recover` if necessary. I would advise reading
:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage.
Now you can go ahead and start modifying the underlying filesystem.
Almost.
You still need to actually journal your filesystem changes, this is done
by wrapping them into transactions. Additionally you also need to wrap
the modification of each of the buffers with calls to the journal layer,
so it knows what the modifications you are actually making are. To do
this use :c:func:`jbd2_journal_start` which returns a transaction handle.
:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`,
which indicates the end of a transaction are nestable calls, so you can
reenter a transaction if necessary, but remember you must call
:c:func:`jbd2_journal_stop` the same number of times as
:c:func:`jbd2_journal_start` before the transaction is completed (or more
accurately leaves the update phase). Ext4/VFS makes use of this feature to
simplify handling of inode dirtying, quota support, etc.
Inside each transaction you need to wrap the modifications to the
individual buffers (blocks). Before you start to modify a buffer you
need to call :c:func:`jbd2_journal_get_create_access()` /
:c:func:`jbd2_journal_get_write_access()` /
:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the
journalling layer to copy the unmodified
data if it needs to. After all the buffer may be part of a previously
uncommitted transaction. At this point you are at last ready to modify a
buffer, and once you are have done so you need to call
:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a
buffer you now know is now longer required to be pushed back on the
device you can call :c:func:`jbd2_journal_forget` in much the same way as you
might have used :c:func:`bforget` in the past.
A :c:func:`jbd2_journal_flush` may be called at any time to commit and
checkpoint all your transactions.
Then at umount time , in your :c:func:`put_super` you can then call
:c:func:`jbd2_journal_destroy` to clean up your in-core journal object.
Unfortunately there a couple of ways the journal layer can cause a
deadlock. The first thing to note is that each task can only have a
single outstanding transaction at any one time, remember nothing commits
until the outermost :c:func:`jbd2_journal_stop`. This means you must complete
the transaction at the end of each file/inode/address etc. operation you
perform, so that the journalling system isn't re-entered on another
journal. Since transactions can't be nested/batched across differing
journals, and another filesystem other than yours (say ext4) may be
modified in a later syscall.
The second case to bear in mind is that :c:func:`jbd2_journal_start` can block
if there isn't enough space in the journal for your transaction (based
on the passed nblocks param) - when it blocks it merely(!) needs to wait
for transactions to complete and be committed from other tasks, so
essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid
deadlocks you must treat :c:func:`jbd2_journal_start` /
:c:func:`jbd2_journal_stop` as if they were semaphores and include them in
your semaphore ordering rules to prevent
deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking
behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as
easily as on :c:func:`jbd2_journal_start`.
Try to reserve the right number of blocks the first time. ;-). This will
be the maximum number of blocks you are going to touch in this
transaction. I advise having a look at at least ext4_jbd.h to see the
basis on which ext4 uses to make these decisions.
Another wriggle to watch out for is your on-disk block allocation
strategy. Why? Because, if you do a delete, you need to ensure you
haven't reused any of the freed blocks until the transaction freeing
these blocks commits. If you reused these blocks and crash happens,
there is no way to restore the contents of the reallocated blocks at the
end of the last fully committed transaction. One simple way of doing
this is to mark blocks as free in internal in-memory block allocation
structures only after the transaction freeing them commits. Ext4 uses
journal commit callback for this purpose.
With journal commit callbacks you can ask the journalling layer to call
a callback function when the transaction is finally committed to disk,
so that you can do some of your own management. You ask the journalling
layer for calling the callback by simply setting
``journal->j_commit_callback`` function pointer and that function is
called after each transaction commit. You can also use
``transaction->t_private_list`` for attaching entries to a transaction
that need processing when the transaction commits.
JBD2 also provides a way to block all transaction updates via
:c:func:`jbd2_journal_lock_updates()` /
:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
window with a clean and stable fs for a moment. E.g.
::
jbd2_journal_lock_updates() //stop new stuff happening..
jbd2_journal_flush() // checkpoint everything.
..do stuff on stable fs
jbd2_journal_unlock_updates() // carry on with filesystem use.
The opportunities for abuse and DOS attacks with this should be obvious,
if you allow unprivileged userspace to trigger codepaths containing
these calls.
Summary
~~~~~~~
Using the journal is a matter of wrapping the different context changes,
being each mount, each modification (transaction) and each changed
buffer to tell the journalling layer about them.
Data Types
----------
The journalling layer uses typedefs to 'hide' the concrete definitions
of the structures used. As a client of the JBD2 layer you can just rely
on the using the pointer as a magic cookie of some sort. Obviously the
hiding is not enforced as this is 'C'.
Structures
~~~~~~~~~~
.. kernel-doc:: include/linux/jbd2.h
:internal:
Functions
---------
The functions here are split into two groups those that affect a journal
as a whole, and those which are used to manage transactions
Journal Level
~~~~~~~~~~~~~
.. kernel-doc:: fs/jbd2/journal.c
:export:
.. kernel-doc:: fs/jbd2/recovery.c
:internal:
Transasction Level
~~~~~~~~~~~~~~~~~~
.. kernel-doc:: fs/jbd2/transaction.c
See also
--------
`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen
Tweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__
`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen
Tweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__
splice API
==========
splice is a method for moving blocks of data around inside the kernel,
without continually transferring them between the kernel and user space.
.. kernel-doc:: fs/splice.c
pipes API
=========
Pipe interfaces are all for in-kernel (builtin image) use. They are not
exported for use by modules.
.. kernel-doc:: include/linux/pipe_fs_i.h
:internal:
.. kernel-doc:: fs/pipe.c
Encryption API
==============
A library which filesystems can hook into to support transparent
encryption of files and directories.
.. toctree::
:maxdepth: 2
fscrypt
Pathname lookup
===============
This write-up is based on three articles published at lwn.net:
- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
- <https://lwn.net/Articles/650786/> A walk among the symlinks
Written by Neil Brown with help from Al Viro and Jon Corbet.
It has subsequently been updated to reflect changes in the kernel
including:
- per-directory parallel name lookup.
See these manuals for documentation about the VFS layer itself and how its
algorithms work.
.. toctree::
:maxdepth: 2
path-lookup.rst
api-summary
splice
Filesystem support layers
=========================
Documentation for the support code within the filesystem layer for use in
filesystem implementations.
.. toctree::
:maxdepth: 2
journalling
fscrypt
Filesystem-specific documentation
=================================
Documentation for individual filesystem types can be found here.
.. toctree::
:maxdepth: 2
binderfs.rst

View File

@@ -0,0 +1,184 @@
The Linux Journalling API
=========================
Overview
--------
Details
~~~~~~~
The journalling layer is easy to use. You need to first of all create a
journal_t data structure. There are two calls to do this dependent on
how you decide to allocate the physical media on which the journal
resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in
filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used
for journal stored on a raw device (in a continuous range of blocks). A
journal_t is a typedef for a struct pointer, so when you are finally
finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up
any used kernel memory.
Once you have got your journal_t object you need to 'mount' or load the
journal file. The journalling layer expects the space for the journal
was already allocated and initialized properly by the userspace tools.
When loading the journal you must call :c:func:`jbd2_journal_load` to process
journal contents. If the client file system detects the journal contents
does not need to be processed (or even need not have valid contents), it
may call :c:func:`jbd2_journal_wipe` to clear the journal contents before
calling :c:func:`jbd2_journal_load`.
Note that jbd2_journal_wipe(..,0) calls
:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding
transactions in the journal and similarly :c:func:`jbd2_journal_load` will
call :c:func:`jbd2_journal_recover` if necessary. I would advise reading
:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage.
Now you can go ahead and start modifying the underlying filesystem.
Almost.
You still need to actually journal your filesystem changes, this is done
by wrapping them into transactions. Additionally you also need to wrap
the modification of each of the buffers with calls to the journal layer,
so it knows what the modifications you are actually making are. To do
this use :c:func:`jbd2_journal_start` which returns a transaction handle.
:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`,
which indicates the end of a transaction are nestable calls, so you can
reenter a transaction if necessary, but remember you must call
:c:func:`jbd2_journal_stop` the same number of times as
:c:func:`jbd2_journal_start` before the transaction is completed (or more
accurately leaves the update phase). Ext4/VFS makes use of this feature to
simplify handling of inode dirtying, quota support, etc.
Inside each transaction you need to wrap the modifications to the
individual buffers (blocks). Before you start to modify a buffer you
need to call :c:func:`jbd2_journal_get_create_access()` /
:c:func:`jbd2_journal_get_write_access()` /
:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the
journalling layer to copy the unmodified
data if it needs to. After all the buffer may be part of a previously
uncommitted transaction. At this point you are at last ready to modify a
buffer, and once you are have done so you need to call
:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a
buffer you now know is now longer required to be pushed back on the
device you can call :c:func:`jbd2_journal_forget` in much the same way as you
might have used :c:func:`bforget` in the past.
A :c:func:`jbd2_journal_flush` may be called at any time to commit and
checkpoint all your transactions.
Then at umount time , in your :c:func:`put_super` you can then call
:c:func:`jbd2_journal_destroy` to clean up your in-core journal object.
Unfortunately there a couple of ways the journal layer can cause a
deadlock. The first thing to note is that each task can only have a
single outstanding transaction at any one time, remember nothing commits
until the outermost :c:func:`jbd2_journal_stop`. This means you must complete
the transaction at the end of each file/inode/address etc. operation you
perform, so that the journalling system isn't re-entered on another
journal. Since transactions can't be nested/batched across differing
journals, and another filesystem other than yours (say ext4) may be
modified in a later syscall.
The second case to bear in mind is that :c:func:`jbd2_journal_start` can block
if there isn't enough space in the journal for your transaction (based
on the passed nblocks param) - when it blocks it merely(!) needs to wait
for transactions to complete and be committed from other tasks, so
essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid
deadlocks you must treat :c:func:`jbd2_journal_start` /
:c:func:`jbd2_journal_stop` as if they were semaphores and include them in
your semaphore ordering rules to prevent
deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking
behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as
easily as on :c:func:`jbd2_journal_start`.
Try to reserve the right number of blocks the first time. ;-). This will
be the maximum number of blocks you are going to touch in this
transaction. I advise having a look at at least ext4_jbd.h to see the
basis on which ext4 uses to make these decisions.
Another wriggle to watch out for is your on-disk block allocation
strategy. Why? Because, if you do a delete, you need to ensure you
haven't reused any of the freed blocks until the transaction freeing
these blocks commits. If you reused these blocks and crash happens,
there is no way to restore the contents of the reallocated blocks at the
end of the last fully committed transaction. One simple way of doing
this is to mark blocks as free in internal in-memory block allocation
structures only after the transaction freeing them commits. Ext4 uses
journal commit callback for this purpose.
With journal commit callbacks you can ask the journalling layer to call
a callback function when the transaction is finally committed to disk,
so that you can do some of your own management. You ask the journalling
layer for calling the callback by simply setting
``journal->j_commit_callback`` function pointer and that function is
called after each transaction commit. You can also use
``transaction->t_private_list`` for attaching entries to a transaction
that need processing when the transaction commits.
JBD2 also provides a way to block all transaction updates via
:c:func:`jbd2_journal_lock_updates()` /
:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
window with a clean and stable fs for a moment. E.g.
::
jbd2_journal_lock_updates() //stop new stuff happening..
jbd2_journal_flush() // checkpoint everything.
..do stuff on stable fs
jbd2_journal_unlock_updates() // carry on with filesystem use.
The opportunities for abuse and DOS attacks with this should be obvious,
if you allow unprivileged userspace to trigger codepaths containing
these calls.
Summary
~~~~~~~
Using the journal is a matter of wrapping the different context changes,
being each mount, each modification (transaction) and each changed
buffer to tell the journalling layer about them.
Data Types
----------
The journalling layer uses typedefs to 'hide' the concrete definitions
of the structures used. As a client of the JBD2 layer you can just rely
on the using the pointer as a magic cookie of some sort. Obviously the
hiding is not enforced as this is 'C'.
Structures
~~~~~~~~~~
.. kernel-doc:: include/linux/jbd2.h
:internal:
Functions
---------
The functions here are split into two groups those that affect a journal
as a whole, and those which are used to manage transactions
Journal Level
~~~~~~~~~~~~~
.. kernel-doc:: fs/jbd2/journal.c
:export:
.. kernel-doc:: fs/jbd2/recovery.c
:internal:
Transasction Level
~~~~~~~~~~~~~~~~~~
.. kernel-doc:: fs/jbd2/transaction.c
See also
--------
`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen
Tweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__
`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen
Tweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__

View File

@@ -1,3 +1,18 @@
===============
Pathname lookup
===============
This write-up is based on three articles published at lwn.net:
- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
- <https://lwn.net/Articles/650786/> A walk among the symlinks
Written by Neil Brown with help from Al Viro and Jon Corbet.
It has subsequently been updated to reflect changes in the kernel
including:
- per-directory parallel name lookup.
Introduction to pathname lookup
===============================
@@ -344,7 +359,7 @@ In particular it is held while scanning chains in the dcache hash
table, and the mount point hash table.
Bringing it together with ``struct nameidata``
--------------------------------------------
----------------------------------------------
.. _First edition Unix: http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
@@ -355,7 +370,7 @@ converts a "name" to an "inode". ``struct nameidata`` contains (among
other fields):
``struct path path``
~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~
A ``path`` contains a ``struct vfsmount`` (which is
embedded in a ``struct mount``) and a ``struct dentry``. Together these
@@ -366,13 +381,13 @@ step. A reference through ``d_lockref`` and ``mnt_count`` is always
held.
``struct qstr last``
~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~
This is a string together with a length (i.e. _not_ ``nul`` terminated)
that is the "next" component in the pathname.
``int last_type``
~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~
This is one of ``LAST_NORM``, ``LAST_ROOT``, ``LAST_DOT``, ``LAST_DOTDOT``, or
``LAST_BIND``. The ``last`` field is only valid if the type is
@@ -381,7 +396,7 @@ components of the symlink have been processed yet. Others should be
fairly self-explanatory.
``struct path root``
~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~
This is used to hold a reference to the effective root of the
filesystem. Often that reference won't be needed, so this field is
@@ -510,7 +525,7 @@ potentially interesting things about these dentries corresponding
to three different flags that might be set in ``dentry->d_flags``:
``DCACHE_MANAGE_TRANSIT``
~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~
If this flag has been set, then the filesystem has requested that the
``d_manage()`` dentry operation be called before handling any possible
@@ -529,7 +544,7 @@ filesystem, which will then give it a special pass through
``d_manage()`` by returning ``-EISDIR``.
``DCACHE_MOUNTED``
~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~
This flag is set on every dentry that is mounted on. As Linux
supports multiple filesystem namespaces, it is possible that the
@@ -542,7 +557,7 @@ If this flag is set, and ``d_manage()`` didn't return ``-EISDIR``,
and a new ``dentry`` (both with counted references).
``DCACHE_NEED_AUTOMOUNT``
~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~
If ``d_manage()`` allowed us to get this far, and ``lookup_mnt()`` didn't
find a mount point, then this flag causes the ``d_automount()`` dentry
@@ -698,7 +713,7 @@ With that little refresher on seqlocks out of the way we can look at
the bigger picture of how RCU-walk uses seqlocks.
``mount_lock`` and ``nd->m_seq``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We already met the ``mount_lock`` seqlock when REF-walk used it to
ensure that crossing a mount point is performed safely. RCU-walk uses
@@ -727,7 +742,7 @@ results would have been the same. This ensures the invariant holds,
at least for vfsmount structures.
``dentry->d_seq`` and ``nd->seq``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In place of taking a count or lock on ``d_reflock``, RCU-walk samples
the per-dentry ``d_seq`` seqlock, and stores the sequence number in the
@@ -774,7 +789,7 @@ getting a counted reference to the new dentry before dropping that for
the old dentry which we saw in REF-walk.
No ``inode->i_rwsem`` or even ``rename_lock``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A semaphore is a fairly heavyweight lock that can only be taken when it is
permissible to sleep. As ``rcu_read_lock()`` forbids sleeping,
@@ -796,7 +811,7 @@ locking. This neatly handles all cases, so adding extra checks on
rename_lock would bring no significant value.
``unlazy walk()`` and ``complete_walk()``
-------------------------------------
-----------------------------------------
That "dropping down to REF-walk" typically involves a call to
``unlazy_walk()``, so named because "RCU-walk" is also sometimes

View File

@@ -0,0 +1,22 @@
================
splice and pipes
================
splice API
==========
splice is a method for moving blocks of data around inside the kernel,
without continually transferring them between the kernel and user space.
.. kernel-doc:: fs/splice.c
pipes API
=========
Pipe interfaces are all for in-kernel (builtin image) use. They are not
exported for use by modules.
.. kernel-doc:: include/linux/pipe_fs_i.h
:internal:
.. kernel-doc:: fs/pipe.c

View File

@@ -116,6 +116,27 @@ static struct device_attribute dev_attr_foo = {
.store = store_foo,
};
Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally
considered a bad idea." so trying to set a sysfs file writable for
everyone will fail reverting to RO mode for "Others".
For the common cases sysfs.h provides convenience macros to make
defining attributes easier as well as making code more concise and
readable. The above case could be shortened to:
static struct device_attribute dev_attr_foo = __ATTR_RW(foo);
the list of helpers available to define your wrapper function is:
__ATTR_RO(name): assumes default name_show and mode 0444
__ATTR_WO(name): assumes a name_store only and is restricted to mode
0200 that is root write access only.
__ATTR_RO_MODE(name, mode): fore more restrictive RO access currently
only use case is the EFI System Resource Table
(see drivers/firmware/efi/esrt.c)
__ATTR_RW(name): assumes default name_show, name_store and setting
mode to 0644.
__ATTR_NULL: which sets the name to NULL and is used as end of list
indicator (see: kernel/workqueue.c)
Subsystem-Specific Callbacks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~