Merge branch master from git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Pick up the spectre documentation so the Grand Schemozzle can be added.
This commit is contained in:
Thomas Gleixner
2019-07-28 22:22:40 +02:00
12594 changed files with 1005238 additions and 433120 deletions

View File

@@ -19,3 +19,13 @@ block device backing the filesystem is not read-only, a sysctl is
created to toggle pinning: ``/proc/sys/kernel/loadpin/enabled``. (Having
a mutable filesystem means pinning is mutable too, but having the
sysctl allows for easy testing on systems with a mutable filesystem.)
It's also possible to exclude specific file types from LoadPin using kernel
command line option "``loadpin.exclude``". By default, all files are
included, but they can be excluded using kernel command line option such
as "``loadpin.exclude=kernel-module,kexec-image``". This allows to use
different mechanisms such as ``CONFIG_MODULE_SIG`` and
``CONFIG_KEXEC_VERIFY_SIG`` to verify kernel module and kernel image while
still use LoadPin to protect the integrity of other files kernel loads. The
full list of valid file types can be found in ``kernel_read_file_str``
defined in ``include/linux/fs.h``.

View File

@@ -227,7 +227,7 @@ Configuring the kernel
"make tinyconfig" Configure the tiniest possible kernel.
You can find more information on using the Linux kernel config tools
in Documentation/kbuild/kconfig.txt.
in Documentation/kbuild/kconfig.rst.
- NOTES on ``make config``:

View File

@@ -0,0 +1,150 @@
Introduction
============
ATA over Ethernet is a network protocol that provides simple access to
block storage on the LAN.
http://support.coraid.com/documents/AoEr11.txt
The EtherDrive (R) HOWTO for 2.6 and 3.x kernels is found at ...
http://support.coraid.com/support/linux/EtherDrive-2.6-HOWTO.html
It has many tips and hints! Please see, especially, recommended
tunings for virtual memory:
http://support.coraid.com/support/linux/EtherDrive-2.6-HOWTO-5.html#ss5.19
The aoetools are userland programs that are designed to work with this
driver. The aoetools are on sourceforge.
http://aoetools.sourceforge.net/
The scripts in this Documentation/admin-guide/aoe directory are intended to
document the use of the driver and are not necessary if you install
the aoetools.
Creating Device Nodes
=====================
Users of udev should find the block device nodes created
automatically, but to create all the necessary device nodes, use the
udev configuration rules provided in udev.txt (in this directory).
There is a udev-install.sh script that shows how to install these
rules on your system.
There is also an autoload script that shows how to edit
/etc/modprobe.d/aoe.conf to ensure that the aoe module is loaded when
necessary. Preloading the aoe module is preferable to autoloading,
however, because AoE discovery takes a few seconds. It can be
confusing when an AoE device is not present the first time the a
command is run but appears a second later.
Using Device Nodes
==================
"cat /dev/etherd/err" blocks, waiting for error diagnostic output,
like any retransmitted packets.
"echo eth2 eth4 > /dev/etherd/interfaces" tells the aoe driver to
limit ATA over Ethernet traffic to eth2 and eth4. AoE traffic from
untrusted networks should be ignored as a matter of security. See
also the aoe_iflist driver option described below.
"echo > /dev/etherd/discover" tells the driver to find out what AoE
devices are available.
In the future these character devices may disappear and be replaced
by sysfs counterparts. Using the commands in aoetools insulates
users from these implementation details.
The block devices are named like this::
e{shelf}.{slot}
e{shelf}.{slot}p{part}
... so that "e0.2" is the third blade from the left (slot 2) in the
first shelf (shelf address zero). That's the whole disk. The first
partition on that disk would be "e0.2p1".
Using sysfs
===========
Each aoe block device in /sys/block has the extra attributes of
state, mac, and netif. The state attribute is "up" when the device
is ready for I/O and "down" if detected but unusable. The
"down,closewait" state shows that the device is still open and
cannot come up again until it has been closed.
The mac attribute is the ethernet address of the remote AoE device.
The netif attribute is the network interface on the localhost
through which we are communicating with the remote AoE device.
There is a script in this directory that formats this information in
a convenient way. Users with aoetools should use the aoe-stat
command::
root@makki root# sh Documentation/admin-guide/aoe/status.sh
e10.0 eth3 up
e10.1 eth3 up
e10.2 eth3 up
e10.3 eth3 up
e10.4 eth3 up
e10.5 eth3 up
e10.6 eth3 up
e10.7 eth3 up
e10.8 eth3 up
e10.9 eth3 up
e4.0 eth1 up
e4.1 eth1 up
e4.2 eth1 up
e4.3 eth1 up
e4.4 eth1 up
e4.5 eth1 up
e4.6 eth1 up
e4.7 eth1 up
e4.8 eth1 up
e4.9 eth1 up
Use /sys/module/aoe/parameters/aoe_iflist (or better, the driver
option discussed below) instead of /dev/etherd/interfaces to limit
AoE traffic to the network interfaces in the given
whitespace-separated list. Unlike the old character device, the
sysfs entry can be read from as well as written to.
It's helpful to trigger discovery after setting the list of allowed
interfaces. The aoetools package provides an aoe-discover script
for this purpose. You can also directly use the
/dev/etherd/discover special file described above.
Driver Options
==============
There is a boot option for the built-in aoe driver and a
corresponding module parameter, aoe_iflist. Without this option,
all network interfaces may be used for ATA over Ethernet. Here is a
usage example for the module parameter::
modprobe aoe_iflist="eth1 eth3"
The aoe_deadsecs module parameter determines the maximum number of
seconds that the driver will wait for an AoE device to provide a
response to an AoE command. After aoe_deadsecs seconds have
elapsed, the AoE device will be marked as "down". A value of zero
is supported for testing purposes and makes the aoe driver keep
trying AoE commands forever.
The aoe_maxout module parameter has a default of 128. This is the
maximum number of unresponded packets that will be sent to an AoE
target at one time.
The aoe_dyndevs module parameter defaults to 1, meaning that the
driver will assign a block device minor number to a discovered AoE
target based on the order of its discovery. With dynamic minor
device numbers in use, a greater range of AoE shelf and slot
addresses can be supported. Users with udev will never have to
think about minor numbers. Using aoe_dyndevs=0 allows device nodes
to be pre-created using a static minor-number scheme with the
aoe-mkshelf script in the aoetools.

View File

@@ -0,0 +1,17 @@
#!/bin/sh
# set aoe to autoload by installing the
# aliases in /etc/modprobe.d/
f=/etc/modprobe.d/aoe.conf
if test ! -r $f || test ! -w $f; then
echo "cannot configure $f for module autoloading" 1>&2
exit 1
fi
grep major-152 $f >/dev/null
if [ $? = 1 ]; then
echo alias block-major-152 aoe >> $f
echo alias char-major-152 aoe >> $f
fi

View File

@@ -0,0 +1,23 @@
Example of udev rules
---------------------
.. include:: udev.txt
:literal:
Example of udev install rules script
------------------------------------
.. literalinclude:: udev-install.sh
:language: shell
Example script to get status
----------------------------
.. literalinclude:: status.sh
:language: shell
Example of AoE autoload script
------------------------------
.. literalinclude:: autoload.sh
:language: shell

View File

@@ -0,0 +1,17 @@
=======================
ATA over Ethernet (AoE)
=======================
.. toctree::
:maxdepth: 1
aoe
todo
examples
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@@ -0,0 +1,30 @@
#! /bin/sh
# collate and present sysfs information about AoE storage
#
# A more complete version of this script is aoe-stat, in the
# aoetools.
set -e
format="%8s\t%8s\t%8s\n"
me=`basename $0`
sysd=${sysfs_dir:-/sys}
# printf "$format" device mac netif state
# Suse 9.1 Pro doesn't put /sys in /etc/mtab
#test -z "`mount | grep sysfs`" && {
test ! -d "$sysd/block" && {
echo "$me Error: sysfs is not mounted" 1>&2
exit 1
}
for d in `ls -d $sysd/block/etherd* 2>/dev/null | grep -v p` end; do
# maybe ls comes up empty, so we use "end"
test $d = end && continue
dev=`echo "$d" | sed 's/.*!//'`
printf "$format" \
"$dev" \
"`cat \"$d/netif\"`" \
"`cat \"$d/state\"`"
done | sort

View File

@@ -0,0 +1,17 @@
TODO
====
There is a potential for deadlock when allocating a struct sk_buff for
data that needs to be written out to aoe storage. If the data is
being written from a dirty page in order to free that page, and if
there are no other pages available, then deadlock may occur when a
free page is needed for the sk_buff allocation. This situation has
not been observed, but it would be nice to eliminate any potential for
deadlock under memory pressure.
Because ATA over Ethernet is not fragmented by the kernel's IP code,
the destructor member of the struct sk_buff is available to the aoe
driver. By using a mempool for allocating all but the first few
sk_buffs, and by registering a destructor, we should be able to
efficiently allocate sk_buffs without introducing any potential for
deadlock.

View File

@@ -0,0 +1,33 @@
# install the aoe-specific udev rules from udev.txt into
# the system's udev configuration
#
me="`basename $0`"
# find udev.conf, often /etc/udev/udev.conf
# (or environment can specify where to find udev.conf)
#
if test -z "$conf"; then
if test -r /etc/udev/udev.conf; then
conf=/etc/udev/udev.conf
else
conf="`find /etc -type f -name udev.conf 2> /dev/null`"
if test -z "$conf" || test ! -r "$conf"; then
echo "$me Error: no udev.conf found" 1>&2
exit 1
fi
fi
fi
# find the directory where udev rules are stored, often
# /etc/udev/rules.d
#
rules_d="`sed -n '/^udev_rules=/{ s!udev_rules=!!; s!\"!!g; p; }' $conf`"
if test -z "$rules_d" ; then
rules_d=/etc/udev/rules.d
fi
if test ! -d "$rules_d"; then
echo "$me Error: cannot find udev rules directory" 1>&2
exit 1
fi
sh -xc "cp `dirname $0`/udev.txt $rules_d/60-aoe.rules"

View File

@@ -0,0 +1,26 @@
# These rules tell udev what device nodes to create for aoe support.
# They may be installed along the following lines. Check the section
# 8 udev manpage to see whether your udev supports SUBSYSTEM, and
# whether it uses one or two equal signs for SUBSYSTEM and KERNEL.
#
# ecashin@makki ~$ su
# Password:
# bash# find /etc -type f -name udev.conf
# /etc/udev/udev.conf
# bash# grep udev_rules= /etc/udev/udev.conf
# udev_rules="/etc/udev/rules.d/"
# bash# ls /etc/udev/rules.d/
# 10-wacom.rules 50-udev.rules
# bash# cp /path/to/linux/Documentation/admin-guide/aoe/udev.txt \
# /etc/udev/rules.d/60-aoe.rules
#
# aoe char devices
SUBSYSTEM=="aoe", KERNEL=="discover", NAME="etherd/%k", GROUP="disk", MODE="0220"
SUBSYSTEM=="aoe", KERNEL=="err", NAME="etherd/%k", GROUP="disk", MODE="0440"
SUBSYSTEM=="aoe", KERNEL=="interfaces", NAME="etherd/%k", GROUP="disk", MODE="0220"
SUBSYSTEM=="aoe", KERNEL=="revalidate", NAME="etherd/%k", GROUP="disk", MODE="0220"
SUBSYSTEM=="aoe", KERNEL=="flush", NAME="etherd/%k", GROUP="disk", MODE="0220"
# aoe block devices
KERNEL=="etherd*", GROUP="disk"

View File

@@ -0,0 +1,68 @@
.. SPDX-License-Identifier: GPL-2.0
The Android binderfs Filesystem
===============================
Android binderfs is a filesystem for the Android binder IPC mechanism. It
allows to dynamically add and remove binder devices at runtime. Binder devices
located in a new binderfs instance are independent of binder devices located in
other binderfs instances. Mounting a new binderfs instance makes it possible
to get a set of private binder devices.
Mounting binderfs
-----------------
Android binderfs can be mounted with::
mkdir /dev/binderfs
mount -t binder binder /dev/binderfs
at which point a new instance of binderfs will show up at ``/dev/binderfs``.
In a fresh instance of binderfs no binder devices will be present. There will
only be a ``binder-control`` device which serves as the request handler for
binderfs. Mounting another binderfs instance at a different location will
create a new and separate instance from all other binderfs mounts. This is
identical to the behavior of e.g. ``devpts`` and ``tmpfs``. The Android
binderfs filesystem can be mounted in user namespaces.
Options
-------
max
binderfs instances can be mounted with a limit on the number of binder
devices that can be allocated. The ``max=<count>`` mount option serves as
a per-instance limit. If ``max=<count>`` is set then only ``<count>`` number
of binder devices can be allocated in this binderfs instance.
Allocating binder Devices
-------------------------
.. _ioctl: http://man7.org/linux/man-pages/man2/ioctl.2.html
To allocate a new binder device in a binderfs instance a request needs to be
sent through the ``binder-control`` device node. A request is sent in the form
of an `ioctl() <ioctl_>`_.
What a program needs to do is to open the ``binder-control`` device node and
send a ``BINDER_CTL_ADD`` request to the kernel. Users of binderfs need to
tell the kernel which name the new binder device should get. By default a name
can only contain up to ``BINDERFS_MAX_NAME`` chars including the terminating
zero byte.
Once the request is made via an `ioctl() <ioctl_>`_ passing a ``struct
binder_device`` with the name to the kernel it will allocate a new binder
device and return the major and minor number of the new device in the struct
(This is necessary because binderfs allocates a major device number
dynamically.). After the `ioctl() <ioctl_>`_ returns there will be a new
binder device located under /dev/binderfs with the chosen name.
Deleting binder Devices
-----------------------
.. _unlink: http://man7.org/linux/man-pages/man2/unlink.2.html
.. _rm: http://man7.org/linux/man-pages/man1/rm.1.html
Binderfs binder devices can be deleted via `unlink() <unlink_>`_. This means
that the `rm() <rm_>`_ tool can be used to delete them. Note that the
``binder-control`` device cannot be deleted since this would make the binderfs
instance unuseable. The ``binder-control`` device will be deleted when the
binderfs instance is unmounted and all references to it have been dropped.

View File

@@ -0,0 +1,588 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Created with Inkscape (http://www.inkscape.org/) -->
<svg
xmlns:svg="http://www.w3.org/2000/svg"
xmlns="http://www.w3.org/2000/svg"
version="1.0"
width="210mm"
height="297mm"
viewBox="0 0 21000 29700"
id="svg2"
style="fill-rule:evenodd">
<defs
id="defs4" />
<g
id="Default"
style="visibility:visible">
<desc
id="desc180">Master slide</desc>
</g>
<path
d="M 11999,8601 L 11899,8301 L 12099,8301 L 11999,8601 z"
id="path193"
style="fill:#008000;visibility:visible" />
<path
d="M 11999,7801 L 11999,8361"
id="path197"
style="fill:none;stroke:#008000;visibility:visible" />
<path
d="M 7999,10401 L 7899,10101 L 8099,10101 L 7999,10401 z"
id="path209"
style="fill:#008000;visibility:visible" />
<path
d="M 7999,9601 L 7999,10161"
id="path213"
style="fill:none;stroke:#008000;visibility:visible" />
<path
d="M 11999,7801 L 11685,7840 L 11724,7644 L 11999,7801 z"
id="path225"
style="fill:#008000;visibility:visible" />
<path
d="M 7999,7001 L 11764,7754"
id="path229"
style="fill:none;stroke:#008000;visibility:visible" />
<g
transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,-1244.4792,1416.5139)"
id="g245"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<text
id="text247">
<tspan
x="9139 9368 9579 9808 9986 10075 10252 10481 10659 10837 10909"
y="9284"
id="tspan249">RSDataReply</tspan>
</text>
</g>
<path
d="M 7999,9601 L 8281,9458 L 8311,9655 L 7999,9601 z"
id="path259"
style="fill:#008000;visibility:visible" />
<path
d="M 11999,9001 L 8236,9565"
id="path263"
style="fill:none;stroke:#008000;visibility:visible" />
<g
transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,1620.9382,-1639.4947)"
id="g279"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<text
id="text281">
<tspan
x="8743 8972 9132 9310 9573 9801 10013 10242 10419 10597 10775 10953 11114"
y="7023"
id="tspan283">CsumRSRequest</tspan>
</text>
</g>
<text
id="text297"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4034 4263 4440 4703 4881 5042 5219 5397 5503 5681 5842 6003 6180 6341 6519 6625 6803 6980 7158 7336 7497 7586 7692"
y="5707"
id="tspan299">w_make_resync_request()</tspan>
</text>
<text
id="text313"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12199 12305 12483 12644 12821 12893 13054 13232 13410 13638 13816 13905 14083 14311 14489 14667 14845 15023 15184 15272 15378"
y="7806"
id="tspan315">receive_DataRequest()</tspan>
</text>
<text
id="text329"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12199 12377 12483 12660 12838 13016 13194 13372 13549 13621 13799 13977 14083 14261 14438 14616 14794 14955 15133 15294 15399"
y="8606"
id="tspan331">drbd_endio_read_sec()</tspan>
</text>
<text
id="text345"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12191 12420 12597 12775 12953 13131 13309 13486 13664 13825 13986 14164 14426 14604 14710 14871 15049 15154 15332 15510 15616"
y="9007"
id="tspan347">w_e_end_csum_rs_req()</tspan>
</text>
<text
id="text361"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4444 4550 4728 4889 5066 5138 5299 5477 5655 5883 6095 6324 6501 6590 6768 6997 7175 7352 7424 7585 7691"
y="9507"
id="tspan363">receive_RSDataReply()</tspan>
</text>
<text
id="text377"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4457 4635 4741 4918 5096 5274 5452 5630 5807 5879 6057 6235 6464 6569 6641 6730 6908 7086 7247 7425 7585 7691"
y="10407"
id="tspan379">drbd_endio_write_sec()</tspan>
</text>
<text
id="text393"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4647 4825 5003 5180 5358 5536 5714 5820 5997 6158 6319 6497 6658 6836 7013 7085 7263 7424 7585 7691"
y="10907"
id="tspan395">e_end_resync_block()</tspan>
</text>
<path
d="M 11999,11601 L 11685,11640 L 11724,11444 L 11999,11601 z"
id="path405"
style="fill:#000080;visibility:visible" />
<path
d="M 7999,10801 L 11764,11554"
id="path409"
style="fill:none;stroke:#000080;visibility:visible" />
<g
transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,2434.7562,-1674.649)"
id="g425"
style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
<text
id="text427">
<tspan
x="9320 9621 9726 9798 9887 10065 10277 10438"
y="10943"
id="tspan429">WriteAck</tspan>
</text>
</g>
<text
id="text443"
style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12199 12377 12555 12644 12821 13033 13105 13283 13444 13604 13816 13977 14138 14244"
y="11559"
id="tspan445">got_BlockAck()</tspan>
</text>
<text
id="text459"
style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="7999 8304 8541 8778 8990 9201 9413 9650 10001 10120 10357 10594 10806 11043 11280 11398 11703 11940 12152 12364 12601 12812 12931 13049 13261 13498 13710 13947 14065 14302 14540 14658 14777 14870 15107 15225 15437 15649 15886"
y="4877"
id="tspan461">Checksum based Resync, case not in sync</tspan>
</text>
<text
id="text475"
style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="6961 7266 7571 7854 8159 8299 8536 8654 8891 9010 9247 9484 9603 9840 9958 10077 10170 10407"
y="2806"
id="tspan477">DRBD-8.3 data flow</tspan>
</text>
<text
id="text491"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="5190 5419 5596 5774 5952 6113 6291 6468 6646 6824 6985 7146 7324 7586 7692"
y="7005"
id="tspan493">w_e_send_csum()</tspan>
</text>
<path
d="M 11999,17601 L 11899,17301 L 12099,17301 L 11999,17601 z"
id="path503"
style="fill:#008000;visibility:visible" />
<path
d="M 11999,16801 L 11999,17361"
id="path507"
style="fill:none;stroke:#008000;visibility:visible" />
<path
d="M 11999,16801 L 11685,16840 L 11724,16644 L 11999,16801 z"
id="path519"
style="fill:#008000;visibility:visible" />
<path
d="M 7999,16001 L 11764,16754"
id="path523"
style="fill:none;stroke:#008000;visibility:visible" />
<g
transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,-2539.5806,1529.3491)"
id="g539"
style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
<text
id="text541">
<tspan
x="9269 9498 9709 9798 9959 10048 10226 10437 10598 10776"
y="18265"
id="tspan543">RSIsInSync</tspan>
</text>
</g>
<path
d="M 7999,18601 L 8281,18458 L 8311,18655 L 7999,18601 z"
id="path553"
style="fill:#000080;visibility:visible" />
<path
d="M 11999,18001 L 8236,18565"
id="path557"
style="fill:none;stroke:#000080;visibility:visible" />
<g
transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,3461.4027,-1449.3012)"
id="g573"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<text
id="text575">
<tspan
x="8743 8972 9132 9310 9573 9801 10013 10242 10419 10597 10775 10953 11114"
y="16023"
id="tspan577">CsumRSRequest</tspan>
</text>
</g>
<text
id="text591"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12199 12305 12483 12644 12821 12893 13054 13232 13410 13638 13816 13905 14083 14311 14489 14667 14845 15023 15184 15272 15378"
y="16806"
id="tspan593">receive_DataRequest()</tspan>
</text>
<text
id="text607"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12199 12377 12483 12660 12838 13016 13194 13372 13549 13621 13799 13977 14083 14261 14438 14616 14794 14955 15133 15294 15399"
y="17606"
id="tspan609">drbd_endio_read_sec()</tspan>
</text>
<text
id="text623"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12191 12420 12597 12775 12953 13131 13309 13486 13664 13825 13986 14164 14426 14604 14710 14871 15049 15154 15332 15510 15616"
y="18007"
id="tspan625">w_e_end_csum_rs_req()</tspan>
</text>
<text
id="text639"
style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
<tspan
x="5735 5913 6091 6180 6357 6446 6607 6696 6874 7085 7246 7424 7585 7691"
y="18507"
id="tspan641">got_IsInSync()</tspan>
</text>
<text
id="text655"
style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="7999 8304 8541 8778 8990 9201 9413 9650 10001 10120 10357 10594 10806 11043 11280 11398 11703 11940 12152 12364 12601 12812 12931 13049 13261 13498 13710 13947 14065 14159 14396 14514 14726 14937 15175"
y="13877"
id="tspan657">Checksum based Resync, case in sync</tspan>
</text>
<path
d="M 12000,24601 L 11900,24301 L 12100,24301 L 12000,24601 z"
id="path667"
style="fill:#008000;visibility:visible" />
<path
d="M 12000,23801 L 12000,24361"
id="path671"
style="fill:none;stroke:#008000;visibility:visible" />
<path
d="M 8000,26401 L 7900,26101 L 8100,26101 L 8000,26401 z"
id="path683"
style="fill:#008000;visibility:visible" />
<path
d="M 8000,25601 L 8000,26161"
id="path687"
style="fill:none;stroke:#008000;visibility:visible" />
<path
d="M 12000,23801 L 11686,23840 L 11725,23644 L 12000,23801 z"
id="path699"
style="fill:#008000;visibility:visible" />
<path
d="M 8000,23001 L 11765,23754"
id="path703"
style="fill:none;stroke:#008000;visibility:visible" />
<g
transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,-3543.8452,1630.5143)"
id="g719"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<text
id="text721">
<tspan
x="9464 9710 9921 10150 10328 10505 10577"
y="25236"
id="tspan723">OVReply</tspan>
</text>
</g>
<path
d="M 8000,25601 L 8282,25458 L 8312,25655 L 8000,25601 z"
id="path733"
style="fill:#008000;visibility:visible" />
<path
d="M 12000,25001 L 8237,25565"
id="path737"
style="fill:none;stroke:#008000;visibility:visible" />
<g
transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,4918.2801,-1381.2128)"
id="g753"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<text
id="text755">
<tspan
x="9142 9388 9599 9828 10006 10183 10361 10539 10700"
y="23106"
id="tspan757">OVRequest</tspan>
</text>
</g>
<text
id="text771"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12200 12306 12484 12645 12822 12894 13055 13233 13411 13656 13868 14097 14274 14452 14630 14808 14969 15058 15163"
y="23806"
id="tspan773">receive_OVRequest()</tspan>
</text>
<text
id="text787"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12200 12378 12484 12661 12839 13017 13195 13373 13550 13622 13800 13978 14084 14262 14439 14617 14795 14956 15134 15295 15400"
y="24606"
id="tspan789">drbd_endio_read_sec()</tspan>
</text>
<text
id="text803"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12192 12421 12598 12776 12954 13132 13310 13487 13665 13843 14004 14182 14288 14465 14643 14749"
y="25007"
id="tspan805">w_e_end_ov_req()</tspan>
</text>
<text
id="text819"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="5101 5207 5385 5546 5723 5795 5956 6134 6312 6557 6769 6998 7175 7353 7425 7586 7692"
y="25507"
id="tspan821">receive_OVReply()</tspan>
</text>
<text
id="text835"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4492 4670 4776 4953 5131 5309 5487 5665 5842 5914 6092 6270 6376 6554 6731 6909 7087 7248 7426 7587 7692"
y="26407"
id="tspan837">drbd_endio_read_sec()</tspan>
</text>
<text
id="text851"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4902 5131 5308 5486 5664 5842 6020 6197 6375 6553 6714 6892 6998 7175 7353 7425 7586 7692"
y="26907"
id="tspan853">w_e_end_ov_reply()</tspan>
</text>
<path
d="M 12000,27601 L 11686,27640 L 11725,27444 L 12000,27601 z"
id="path863"
style="fill:#000080;visibility:visible" />
<path
d="M 8000,26801 L 11765,27554"
id="path867"
style="fill:none;stroke:#000080;visibility:visible" />
<g
transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,5704.1907,-1328.312)"
id="g883"
style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
<text
id="text885">
<tspan
x="9279 9525 9736 9965 10143 10303 10481 10553"
y="26935"
id="tspan887">OVResult</tspan>
</text>
</g>
<text
id="text901"
style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12200 12378 12556 12645 12822 13068 13280 13508 13686 13847 14025 14097 14185 14291"
y="27559"
id="tspan903">got_OVResult()</tspan>
</text>
<text
id="text917"
style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="8000 8330 8567 8660 8754 8991 9228 9346 9558 9795 9935 10028 10146"
y="21877"
id="tspan919">Online verify</tspan>
</text>
<text
id="text933"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4641 4870 5047 5310 5488 5649 5826 6004 6182 6343 6521 6626 6804 6982 7160 7338 7499 7587 7693"
y="23005"
id="tspan935">w_make_ov_request()</tspan>
</text>
<path
d="M 8000,6500 L 7900,6200 L 8100,6200 L 8000,6500 z"
id="path945"
style="fill:#008000;visibility:visible" />
<path
d="M 8000,5700 L 8000,6260"
id="path949"
style="fill:none;stroke:#008000;visibility:visible" />
<path
d="M 3900,5500 L 3700,5500 L 3700,11000 L 3900,11000"
id="path961"
style="fill:none;stroke:#000000;visibility:visible" />
<path
d="M 3900,14500 L 3700,14500 L 3700,18600 L 3900,18600"
id="path973"
style="fill:none;stroke:#000000;visibility:visible" />
<path
d="M 3900,22800 L 3700,22800 L 3700,26900 L 3900,26900"
id="path985"
style="fill:none;stroke:#000000;visibility:visible" />
<text
id="text1001"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4492 4670 4776 4953 5131 5309 5487 5665 5842 5914 6092 6270 6376 6554 6731 6909 7087 7248 7426 7587 7692"
y="6506"
id="tspan1003">drbd_endio_read_sec()</tspan>
</text>
<text
id="text1017"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4034 4263 4440 4703 4881 5042 5219 5397 5503 5681 5842 6003 6180 6341 6519 6625 6803 6980 7158 7336 7497 7586 7692"
y="14708"
id="tspan1019">w_make_resync_request()</tspan>
</text>
<text
id="text1033"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="5190 5419 5596 5774 5952 6113 6291 6468 6646 6824 6985 7146 7324 7586 7692"
y="16006"
id="tspan1035">w_e_send_csum()</tspan>
</text>
<path
d="M 8000,15501 L 7900,15201 L 8100,15201 L 8000,15501 z"
id="path1045"
style="fill:#008000;visibility:visible" />
<path
d="M 8000,14701 L 8000,15261"
id="path1049"
style="fill:none;stroke:#008000;visibility:visible" />
<text
id="text1065"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4492 4670 4776 4953 5131 5309 5487 5665 5842 5914 6092 6270 6376 6554 6731 6909 7087 7248 7426 7587 7692"
y="15507"
id="tspan1067">drbd_endio_read_sec()</tspan>
</text>
<path
d="M 16100,9000 L 16300,9000 L 16300,7500 L 16100,7500"
id="path1077"
style="fill:none;stroke:#000000;visibility:visible" />
<path
d="M 16100,18000 L 16300,18000 L 16300,16500 L 16100,16500"
id="path1089"
style="fill:none;stroke:#000000;visibility:visible" />
<path
d="M 16100,25000 L 16300,25000 L 16300,23500 L 16100,23500"
id="path1101"
style="fill:none;stroke:#000000;visibility:visible" />
<text
id="text1117"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="2026 2132 2293 2471 2648 2826 3004 3076 3254 3431 3503 3681 3787"
y="5402"
id="tspan1119">rs_begin_io()</tspan>
</text>
<text
id="text1133"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="2027 2133 2294 2472 2649 2827 3005 3077 3255 3432 3504 3682 3788"
y="14402"
id="tspan1135">rs_begin_io()</tspan>
</text>
<text
id="text1149"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="2026 2132 2293 2471 2648 2826 3004 3076 3254 3431 3503 3681 3787"
y="22602"
id="tspan1151">rs_begin_io()</tspan>
</text>
<text
id="text1165"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="1426 1532 1693 1871 2031 2209 2472 2649 2721 2899 2988 3166 3344 3416 3593 3699"
y="11302"
id="tspan1167">rs_complete_io()</tspan>
</text>
<text
id="text1181"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="1526 1632 1793 1971 2131 2309 2572 2749 2821 2999 3088 3266 3444 3516 3693 3799"
y="18931"
id="tspan1183">rs_complete_io()</tspan>
</text>
<text
id="text1197"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="1526 1632 1793 1971 2131 2309 2572 2749 2821 2999 3088 3266 3444 3516 3693 3799"
y="27231"
id="tspan1199">rs_complete_io()</tspan>
</text>
<text
id="text1213"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="16126 16232 16393 16571 16748 16926 17104 17176 17354 17531 17603 17781 17887"
y="7402"
id="tspan1215">rs_begin_io()</tspan>
</text>
<text
id="text1229"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="16127 16233 16394 16572 16749 16927 17105 17177 17355 17532 17604 17782 17888"
y="16331"
id="tspan1231">rs_begin_io()</tspan>
</text>
<text
id="text1245"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="16127 16233 16394 16572 16749 16927 17105 17177 17355 17532 17604 17782 17888"
y="23302"
id="tspan1247">rs_begin_io()</tspan>
</text>
<text
id="text1261"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="16115 16221 16382 16560 16720 16898 17161 17338 17410 17588 17677 17855 18033 18105 18282 18388"
y="9302"
id="tspan1263">rs_complete_io()</tspan>
</text>
<text
id="text1277"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="16115 16221 16382 16560 16720 16898 17161 17338 17410 17588 17677 17855 18033 18105 18282 18388"
y="18331"
id="tspan1279">rs_complete_io()</tspan>
</text>
<text
id="text1293"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="16126 16232 16393 16571 16731 16909 17172 17349 17421 17599 17688 17866 18044 18116 18293 18399"
y="25302"
id="tspan1295">rs_complete_io()</tspan>
</text>
</svg>

After

Width:  |  Height:  |  Size: 22 KiB

View File

@@ -0,0 +1,459 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Created with Inkscape (http://www.inkscape.org/) -->
<svg
xmlns:svg="http://www.w3.org/2000/svg"
xmlns="http://www.w3.org/2000/svg"
version="1.0"
width="210mm"
height="297mm"
viewBox="0 0 21000 29700"
id="svg2"
style="fill-rule:evenodd">
<defs
id="defs4" />
<g
id="Default"
style="visibility:visible">
<desc
id="desc176">Master slide</desc>
</g>
<path
d="M 11999,19601 L 11899,19301 L 12099,19301 L 11999,19601 z"
id="path189"
style="fill:#008000;visibility:visible" />
<path
d="M 11999,18801 L 11999,19361"
id="path193"
style="fill:none;stroke:#008000;visibility:visible" />
<path
d="M 7999,21401 L 7899,21101 L 8099,21101 L 7999,21401 z"
id="path205"
style="fill:#008000;visibility:visible" />
<path
d="M 7999,20601 L 7999,21161"
id="path209"
style="fill:none;stroke:#008000;visibility:visible" />
<path
d="M 11999,18801 L 11685,18840 L 11724,18644 L 11999,18801 z"
id="path221"
style="fill:#008000;visibility:visible" />
<path
d="M 7999,18001 L 11764,18754"
id="path225"
style="fill:none;stroke:#008000;visibility:visible" />
<text
x="-3023.845"
y="1106.8124"
transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,0,0)"
id="text243"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="6115.1553 6344.1553 6555.1553 6784.1553 6962.1553 7051.1553 7228.1553 7457.1553 7635.1553 7813.1553 7885.1553"
y="21390.812"
id="tspan245">RSDataReply</tspan>
</text>
<path
d="M 7999,20601 L 8281,20458 L 8311,20655 L 7999,20601 z"
id="path255"
style="fill:#008000;visibility:visible" />
<path
d="M 11999,20001 L 8236,20565"
id="path259"
style="fill:none;stroke:#008000;visibility:visible" />
<text
x="3502.5356"
y="-2184.6621"
transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
id="text277"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12321.536 12550.536 12761.536 12990.536 13168.536 13257.536 13434.536 13663.536 13841.536 14019.536 14196.536 14374.536 14535.536"
y="15854.338"
id="tspan279">RSDataRequest</tspan>
</text>
<text
id="text293"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4034 4263 4440 4703 4881 5042 5219 5397 5503 5681 5842 6003 6180 6341 6519 6625 6803 6980 7158 7336 7497 7586 7692"
y="17807"
id="tspan295">w_make_resync_request()</tspan>
</text>
<text
id="text309"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12199 12305 12483 12644 12821 12893 13054 13232 13410 13638 13816 13905 14083 14311 14489 14667 14845 15023 15184 15272 15378"
y="18806"
id="tspan311">receive_DataRequest()</tspan>
</text>
<text
id="text325"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12199 12377 12483 12660 12838 13016 13194 13372 13549 13621 13799 13977 14083 14261 14438 14616 14794 14955 15133 15294 15399"
y="19606"
id="tspan327">drbd_endio_read_sec()</tspan>
</text>
<text
id="text341"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12191 12420 12597 12775 12953 13131 13309 13486 13664 13770 13931 14109 14287 14375 14553 14731 14837 15015 15192 15298"
y="20007"
id="tspan343">w_e_end_rsdata_req()</tspan>
</text>
<text
id="text357"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4444 4550 4728 4889 5066 5138 5299 5477 5655 5883 6095 6324 6501 6590 6768 6997 7175 7352 7424 7585 7691"
y="20507"
id="tspan359">receive_RSDataReply()</tspan>
</text>
<text
id="text373"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4457 4635 4741 4918 5096 5274 5452 5630 5807 5879 6057 6235 6464 6569 6641 6730 6908 7086 7247 7425 7585 7691"
y="21407"
id="tspan375">drbd_endio_write_sec()</tspan>
</text>
<text
id="text389"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4647 4825 5003 5180 5358 5536 5714 5820 5997 6158 6319 6497 6658 6836 7013 7085 7263 7424 7585 7691"
y="21907"
id="tspan391">e_end_resync_block()</tspan>
</text>
<path
d="M 11999,22601 L 11685,22640 L 11724,22444 L 11999,22601 z"
id="path401"
style="fill:#000080;visibility:visible" />
<path
d="M 7999,21801 L 11764,22554"
id="path405"
style="fill:none;stroke:#000080;visibility:visible" />
<text
x="4290.3008"
y="-2369.6162"
transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
id="text423"
style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
<tspan
x="13610.301 13911.301 14016.301 14088.301 14177.301 14355.301 14567.301 14728.301"
y="19573.385"
id="tspan425">WriteAck</tspan>
</text>
<text
id="text439"
style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12199 12377 12555 12644 12821 13033 13105 13283 13444 13604 13816 13977 14138 14244"
y="22559"
id="tspan441">got_BlockAck()</tspan>
</text>
<text
id="text455"
style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="7999 8304 8541 8753 8964 9201 9413 9531 9769 9862 10099 10310 10522 10734 10852 10971 11208 11348 11585 11822"
y="16877"
id="tspan457">Resync blocks, 4-32K</tspan>
</text>
<path
d="M 12000,7601 L 11900,7301 L 12100,7301 L 12000,7601 z"
id="path467"
style="fill:#008000;visibility:visible" />
<path
d="M 12000,6801 L 12000,7361"
id="path471"
style="fill:none;stroke:#008000;visibility:visible" />
<path
d="M 12000,6801 L 11686,6840 L 11725,6644 L 12000,6801 z"
id="path483"
style="fill:#008000;visibility:visible" />
<path
d="M 8000,6001 L 11765,6754"
id="path487"
style="fill:none;stroke:#008000;visibility:visible" />
<text
x="-1288.1796"
y="1279.7666"
transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,0,0)"
id="text505"
style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
<tspan
x="8174.8208 8475.8203 8580.8203 8652.8203 8741.8203 8919.8203 9131.8203 9292.8203"
y="9516.7666"
id="tspan507">WriteAck</tspan>
</text>
<path
d="M 8000,8601 L 8282,8458 L 8312,8655 L 8000,8601 z"
id="path517"
style="fill:#000080;visibility:visible" />
<path
d="M 12000,8001 L 8237,8565"
id="path521"
style="fill:none;stroke:#000080;visibility:visible" />
<text
x="1065.6655"
y="-2097.7664"
transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
id="text539"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="10682.666 10911.666 11088.666 11177.666"
y="4107.2339"
id="tspan541">Data</tspan>
</text>
<text
id="text555"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4746 4924 5030 5207 5385 5563 5826 6003 6164 6342 6520 6626 6803 6981 7159 7337 7498 7587 7692"
y="5505"
id="tspan557">drbd_make_request()</tspan>
</text>
<text
id="text571"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12200 12306 12484 12645 12822 12894 13055 13233 13411 13639 13817 13906 14084 14190"
y="6806"
id="tspan573">receive_Data()</tspan>
</text>
<text
id="text587"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12200 12378 12484 12661 12839 13017 13195 13373 13550 13622 13800 13978 14207 14312 14384 14473 14651 14829 14990 15168 15328 15434"
y="7606"
id="tspan589">drbd_endio_write_sec()</tspan>
</text>
<text
id="text603"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12192 12370 12548 12725 12903 13081 13259 13437 13509 13686 13847 14008 14114"
y="8007"
id="tspan605">e_end_block()</tspan>
</text>
<text
id="text619"
style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
<tspan
x="5647 5825 6003 6092 6269 6481 6553 6731 6892 7052 7264 7425 7586 7692"
y="8606"
id="tspan621">got_BlockAck()</tspan>
</text>
<text
id="text635"
style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="8000 8305 8542 8779 9016 9109 9346 9486 9604 9956 10049 10189 10328 10565 10705 10942 11179 11298 11603 11742 11835 11954 12191 12310 12428 12665 12902 13139 13279 13516 13753"
y="4877"
id="tspan637">Regular mirrored write, 512-32K</tspan>
</text>
<text
id="text651"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="5381 5610 5787 5948 6126 6304 6482 6659 6837 7015 7087 7265 7426 7587 7692"
y="6003"
id="tspan653">w_send_dblock()</tspan>
</text>
<path
d="M 8000,6800 L 7900,6500 L 8100,6500 L 8000,6800 z"
id="path663"
style="fill:#008000;visibility:visible" />
<path
d="M 8000,6000 L 8000,6560"
id="path667"
style="fill:none;stroke:#008000;visibility:visible" />
<text
id="text683"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4602 4780 4886 5063 5241 5419 5597 5775 5952 6024 6202 6380 6609 6714 6786 6875 7053 7231 7409 7515 7587 7692"
y="6905"
id="tspan685">drbd_endio_write_pri()</tspan>
</text>
<path
d="M 12000,13602 L 11900,13302 L 12100,13302 L 12000,13602 z"
id="path695"
style="fill:#008000;visibility:visible" />
<path
d="M 12000,12802 L 12000,13362"
id="path699"
style="fill:none;stroke:#008000;visibility:visible" />
<path
d="M 12000,12802 L 11686,12841 L 11725,12645 L 12000,12802 z"
id="path711"
style="fill:#008000;visibility:visible" />
<path
d="M 8000,12002 L 11765,12755"
id="path715"
style="fill:none;stroke:#008000;visibility:visible" />
<text
x="-2155.5266"
y="1201.5964"
transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,0,0)"
id="text733"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="7202.4736 7431.4736 7608.4736 7697.4736 7875.4736 8104.4736 8282.4736 8459.4736 8531.4736"
y="15454.597"
id="tspan735">DataReply</tspan>
</text>
<path
d="M 8000,14602 L 8282,14459 L 8312,14656 L 8000,14602 z"
id="path745"
style="fill:#008000;visibility:visible" />
<path
d="M 12000,14002 L 8237,14566"
id="path749"
style="fill:none;stroke:#008000;visibility:visible" />
<text
x="2280.3804"
y="-2103.2141"
transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
id="text767"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="11316.381 11545.381 11722.381 11811.381 11989.381 12218.381 12396.381 12573.381 12751.381 12929.381 13090.381"
y="9981.7861"
id="tspan769">DataRequest</tspan>
</text>
<text
id="text783"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="4746 4924 5030 5207 5385 5563 5826 6003 6164 6342 6520 6626 6803 6981 7159 7337 7498 7587 7692"
y="11506"
id="tspan785">drbd_make_request()</tspan>
</text>
<text
id="text799"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12200 12306 12484 12645 12822 12894 13055 13233 13411 13639 13817 13906 14084 14312 14490 14668 14846 15024 15185 15273 15379"
y="12807"
id="tspan801">receive_DataRequest()</tspan>
</text>
<text
id="text815"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12200 12378 12484 12661 12839 13017 13195 13373 13550 13622 13800 13978 14084 14262 14439 14617 14795 14956 15134 15295 15400"
y="13607"
id="tspan817">drbd_endio_read_sec()</tspan>
</text>
<text
id="text831"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="12192 12421 12598 12776 12954 13132 13310 13487 13665 13843 14021 14110 14288 14465 14571 14749 14927 15033"
y="14008"
id="tspan833">w_e_end_data_req()</tspan>
</text>
<g
id="g835"
style="visibility:visible">
<desc
id="desc837">Drawing</desc>
<text
id="text847"
style="font-size:318px;font-weight:400;fill:#008000;font-family:Helvetica embedded">
<tspan
x="4885 4991 5169 5330 5507 5579 5740 5918 6096 6324 6502 6591 6769 6997 7175 7353 7425 7586 7692"
y="14607"
id="tspan849">receive_DataReply()</tspan>
</text>
</g>
<text
id="text863"
style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="8000 8305 8398 8610 8821 8914 9151 9363 9575 9693 9833 10070 10307 10544 10663 10781 11018 11255 11493 11632 11869 12106"
y="10878"
id="tspan865">Diskless read, 512-32K</tspan>
</text>
<text
id="text879"
style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="5029 5258 5435 5596 5774 5952 6130 6307 6413 6591 6769 6947 7125 7230 7408 7586 7692"
y="12004"
id="tspan881">w_send_read_req()</tspan>
</text>
<text
id="text895"
style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="6961 7266 7571 7854 8159 8278 8515 8633 8870 9107 9226 9463 9581 9700 9793 10030"
y="2806"
id="tspan897">DRBD 8 data flow</tspan>
</text>
<path
d="M 3900,5300 L 3700,5300 L 3700,7000 L 3900,7000"
id="path907"
style="fill:none;stroke:#000000;visibility:visible" />
<path
d="M 3900,17600 L 3700,17600 L 3700,22000 L 3900,22000"
id="path919"
style="fill:none;stroke:#000000;visibility:visible" />
<path
d="M 16100,20000 L 16300,20000 L 16300,18500 L 16100,18500"
id="path931"
style="fill:none;stroke:#000000;visibility:visible" />
<text
id="text947"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="2126 2304 2376 2554 2731 2909 3087 3159 3337 3515 3587 3764 3870"
y="5202"
id="tspan949">al_begin_io()</tspan>
</text>
<text
id="text963"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="1632 1810 1882 2060 2220 2398 2661 2839 2910 3088 3177 3355 3533 3605 3783 3888"
y="7331"
id="tspan965">al_complete_io()</tspan>
</text>
<text
id="text979"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="2126 2232 2393 2571 2748 2926 3104 3176 3354 3531 3603 3781 3887"
y="17431"
id="tspan981">rs_begin_io()</tspan>
</text>
<text
id="text995"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="1626 1732 1893 2071 2231 2409 2672 2849 2921 3099 3188 3366 3544 3616 3793 3899"
y="22331"
id="tspan997">rs_complete_io()</tspan>
</text>
<text
id="text1011"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="16027 16133 16294 16472 16649 16827 17005 17077 17255 17432 17504 17682 17788"
y="18402"
id="tspan1013">rs_begin_io()</tspan>
</text>
<text
id="text1027"
style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
<tspan
x="16115 16221 16382 16560 16720 16898 17161 17338 17410 17588 17677 17855 18033 18105 18282 18388"
y="20331"
id="tspan1029">rs_complete_io()</tspan>
</text>
</svg>

After

Width:  |  Height:  |  Size: 17 KiB

View File

@@ -0,0 +1,18 @@
digraph conn_states {
StandAllone -> WFConnection [ label = "ioctl_set_net()" ]
WFConnection -> Unconnected [ label = "unable to bind()" ]
WFConnection -> WFReportParams [ label = "in connect() after accept" ]
WFReportParams -> StandAllone [ label = "checks in receive_param()" ]
WFReportParams -> Connected [ label = "in receive_param()" ]
WFReportParams -> WFBitMapS [ label = "sync_handshake()" ]
WFReportParams -> WFBitMapT [ label = "sync_handshake()" ]
WFBitMapS -> SyncSource [ label = "receive_bitmap()" ]
WFBitMapT -> SyncTarget [ label = "receive_bitmap()" ]
SyncSource -> Connected
SyncTarget -> Connected
SyncSource -> PausedSyncS
SyncTarget -> PausedSyncT
PausedSyncS -> SyncSource
PausedSyncT -> SyncTarget
Connected -> WFConnection [ label = "* on network error" ]
}

View File

@@ -0,0 +1,42 @@
================================
kernel data structure for DRBD-9
================================
This describes the in kernel data structure for DRBD-9. Starting with
Linux v3.14 we are reorganizing DRBD to use this data structure.
Basic Data Structure
====================
A node has a number of DRBD resources. Each such resource has a number of
devices (aka volumes) and connections to other nodes ("peer nodes"). Each DRBD
device is represented by a block device locally.
The DRBD objects are interconnected to form a matrix as depicted below; a
drbd_peer_device object sits at each intersection between a drbd_device and a
drbd_connection::
/--------------+---------------+.....+---------------\
| resource | device | | device |
+--------------+---------------+.....+---------------+
| connection | peer_device | | peer_device |
+--------------+---------------+.....+---------------+
: : : : :
: : : : :
+--------------+---------------+.....+---------------+
| connection | peer_device | | peer_device |
\--------------+---------------+.....+---------------/
In this table, horizontally, devices can be accessed from resources by their
volume number. Likewise, peer_devices can be accessed from connections by
their volume number. Objects in the vertical direction are connected by double
linked lists. There are back pointers from peer_devices to their connections a
devices, and from connections and devices to their resource.
All resources are in the drbd_resources double-linked list. In addition, all
devices can be accessed by their minor device number via the drbd_devices idr.
The drbd_resource, drbd_connection, and drbd_device objects are reference
counted. The peer_device objects only serve to establish the links between
devices and connections; their lifetime is determined by the lifetime of the
device and connection which they reference.

View File

@@ -0,0 +1,16 @@
digraph disk_states {
Diskless -> Inconsistent [ label = "ioctl_set_disk()" ]
Diskless -> Consistent [ label = "ioctl_set_disk()" ]
Diskless -> Outdated [ label = "ioctl_set_disk()" ]
Consistent -> Outdated [ label = "receive_param()" ]
Consistent -> UpToDate [ label = "receive_param()" ]
Consistent -> Inconsistent [ label = "start resync" ]
Outdated -> Inconsistent [ label = "start resync" ]
UpToDate -> Inconsistent [ label = "ioctl_replicate" ]
Inconsistent -> UpToDate [ label = "resync completed" ]
Consistent -> Failed [ label = "io completion error" ]
Outdated -> Failed [ label = "io completion error" ]
UpToDate -> Failed [ label = "io completion error" ]
Inconsistent -> Failed [ label = "io completion error" ]
Failed -> Diskless [ label = "sending notify to peer" ]
}

View File

@@ -0,0 +1,85 @@
// vim: set sw=2 sts=2 :
digraph {
rankdir=BT
bgcolor=white
node [shape=plaintext]
node [fontcolor=black]
StandAlone [ style=filled,fillcolor=gray,label=StandAlone ]
node [fontcolor=lightgray]
Unconnected [ label=Unconnected ]
CommTrouble [ shape=record,
label="{communication loss|{Timeout|BrokenPipe|NetworkFailure}}" ]
node [fontcolor=gray]
subgraph cluster_try_connect {
label="try to connect, handshake"
rank=max
WFConnection [ label=WFConnection ]
WFReportParams [ label=WFReportParams ]
}
TearDown [ label=TearDown ]
Connected [ label=Connected,style=filled,fillcolor=green,fontcolor=black ]
node [fontcolor=lightblue]
StartingSyncS [ label=StartingSyncS ]
StartingSyncT [ label=StartingSyncT ]
subgraph cluster_bitmap_exchange {
node [fontcolor=red]
fontcolor=red
label="new application (WRITE?) requests blocked\lwhile bitmap is exchanged"
WFBitMapT [ label=WFBitMapT ]
WFSyncUUID [ label=WFSyncUUID ]
WFBitMapS [ label=WFBitMapS ]
}
node [fontcolor=blue]
cluster_resync [ shape=record,label="{<any>resynchronisation process running\l'concurrent' application requests allowed|{{<T>PausedSyncT\nSyncTarget}|{<S>PausedSyncS\nSyncSource}}}" ]
node [shape=box,fontcolor=black]
// drbdadm [label="drbdadm connect"]
// handshake [label="drbd_connect()\ndrbd_do_handshake\ndrbd_sync_handshake() etc."]
// comm_error [label="communication trouble"]
//
// edges
// --------------------------------------
StandAlone -> Unconnected [ label="drbdadm connect" ]
Unconnected -> StandAlone [ label="drbdadm disconnect\lor serious communication trouble" ]
Unconnected -> WFConnection [ label="receiver thread is started" ]
WFConnection -> WFReportParams [ headlabel="accept()\land/or \lconnect()\l" ]
WFReportParams -> StandAlone [ label="during handshake\lpeers do not agree\labout something essential" ]
WFReportParams -> Connected [ label="data identical\lno sync needed",color=green,fontcolor=green ]
WFReportParams -> WFBitMapS
WFReportParams -> WFBitMapT
WFBitMapT -> WFSyncUUID [minlen=0.1,constraint=false]
WFBitMapS -> cluster_resync:S
WFSyncUUID -> cluster_resync:T
edge [color=green]
cluster_resync:any -> Connected [ label="resnyc done",fontcolor=green ]
edge [color=red]
WFReportParams -> CommTrouble
Connected -> CommTrouble
cluster_resync:any -> CommTrouble
edge [color=black]
CommTrouble -> Unconnected [label="receiver thread is stopped" ]
}

View File

@@ -0,0 +1,30 @@
.. SPDX-License-Identifier: GPL-2.0
.. The here included files are intended to help understand the implementation
Data flows that Relate some functions, and write packets
========================================================
.. kernel-figure:: DRBD-8.3-data-packets.svg
:alt: DRBD-8.3-data-packets.svg
:align: center
.. kernel-figure:: DRBD-data-packets.svg
:alt: DRBD-data-packets.svg
:align: center
Sub graphs of DRBD's state transitions
======================================
.. kernel-figure:: conn-states-8.dot
:alt: conn-states-8.dot
:align: center
.. kernel-figure:: disk-states-8.dot
:alt: disk-states-8.dot
:align: center
.. kernel-figure:: node-states-8.dot
:alt: node-states-8.dot
:align: center

View File

@@ -0,0 +1,19 @@
==========================================
Distributed Replicated Block Device - DRBD
==========================================
Description
===========
DRBD is a shared-nothing, synchronously replicated block device. It
is designed to serve as a building block for high availability
clusters and in this context, is a "drop-in" replacement for shared
storage. Simplistically, you could see it as a network RAID 1.
Please visit http://www.drbd.org to find out more.
.. toctree::
:maxdepth: 1
data-structure-v9
figures

View File

@@ -0,0 +1,13 @@
digraph node_states {
Secondary -> Primary [ label = "ioctl_set_state()" ]
Primary -> Secondary [ label = "ioctl_set_state()" ]
}
digraph peer_states {
Secondary -> Primary [ label = "recv state packet" ]
Primary -> Secondary [ label = "recv state packet" ]
Primary -> Unknown [ label = "connection lost" ]
Secondary -> Unknown [ label = "connection lost" ]
Unknown -> Primary [ label = "connected" ]
Unknown -> Secondary [ label = "connected" ]
}

View File

@@ -0,0 +1,255 @@
=============
Floppy Driver
=============
FAQ list:
=========
A FAQ list may be found in the fdutils package (see below), and also
at <http://fdutils.linux.lu/faq.html>.
LILO configuration options (Thinkpad users, read this)
======================================================
The floppy driver is configured using the 'floppy=' option in
lilo. This option can be typed at the boot prompt, or entered in the
lilo configuration file.
Example: If your kernel is called linux-2.6.9, type the following line
at the lilo boot prompt (if you have a thinkpad)::
linux-2.6.9 floppy=thinkpad
You may also enter the following line in /etc/lilo.conf, in the description
of linux-2.6.9::
append = "floppy=thinkpad"
Several floppy related options may be given, example::
linux-2.6.9 floppy=daring floppy=two_fdc
append = "floppy=daring floppy=two_fdc"
If you give options both in the lilo config file and on the boot
prompt, the option strings of both places are concatenated, the boot
prompt options coming last. That's why there are also options to
restore the default behavior.
Module configuration options
============================
If you use the floppy driver as a module, use the following syntax::
modprobe floppy floppy="<options>"
Example::
modprobe floppy floppy="omnibook messages"
If you need certain options enabled every time you load the floppy driver,
you can put::
options floppy floppy="omnibook messages"
in a configuration file in /etc/modprobe.d/.
The floppy driver related options are:
floppy=asus_pci
Sets the bit mask to allow only units 0 and 1. (default)
floppy=daring
Tells the floppy driver that you have a well behaved floppy controller.
This allows more efficient and smoother operation, but may fail on
certain controllers. This may speed up certain operations.
floppy=0,daring
Tells the floppy driver that your floppy controller should be used
with caution.
floppy=one_fdc
Tells the floppy driver that you have only one floppy controller.
(default)
floppy=two_fdc / floppy=<address>,two_fdc
Tells the floppy driver that you have two floppy controllers.
The second floppy controller is assumed to be at <address>.
This option is not needed if the second controller is at address
0x370, and if you use the 'cmos' option.
floppy=thinkpad
Tells the floppy driver that you have a Thinkpad. Thinkpads use an
inverted convention for the disk change line.
floppy=0,thinkpad
Tells the floppy driver that you don't have a Thinkpad.
floppy=omnibook / floppy=nodma
Tells the floppy driver not to use Dma for data transfers.
This is needed on HP Omnibooks, which don't have a workable
DMA channel for the floppy driver. This option is also useful
if you frequently get "Unable to allocate DMA memory" messages.
Indeed, dma memory needs to be continuous in physical memory,
and is thus harder to find, whereas non-dma buffers may be
allocated in virtual memory. However, I advise against this if
you have an FDC without a FIFO (8272A or 82072). 82072A and
later are OK. You also need at least a 486 to use nodma.
If you use nodma mode, I suggest you also set the FIFO
threshold to 10 or lower, in order to limit the number of data
transfer interrupts.
If you have a FIFO-able FDC, the floppy driver automatically
falls back on non DMA mode if no DMA-able memory can be found.
If you want to avoid this, explicitly ask for 'yesdma'.
floppy=yesdma
Tells the floppy driver that a workable DMA channel is available.
(default)
floppy=nofifo
Disables the FIFO entirely. This is needed if you get "Bus
master arbitration error" messages from your Ethernet card (or
from other devices) while accessing the floppy.
floppy=usefifo
Enables the FIFO. (default)
floppy=<threshold>,fifo_depth
Sets the FIFO threshold. This is mostly relevant in DMA
mode. If this is higher, the floppy driver tolerates more
interrupt latency, but it triggers more interrupts (i.e. it
imposes more load on the rest of the system). If this is
lower, the interrupt latency should be lower too (faster
processor). The benefit of a lower threshold is less
interrupts.
To tune the fifo threshold, switch on over/underrun messages
using 'floppycontrol --messages'. Then access a floppy
disk. If you get a huge amount of "Over/Underrun - retrying"
messages, then the fifo threshold is too low. Try with a
higher value, until you only get an occasional Over/Underrun.
It is a good idea to compile the floppy driver as a module
when doing this tuning. Indeed, it allows to try different
fifo values without rebooting the machine for each test. Note
that you need to do 'floppycontrol --messages' every time you
re-insert the module.
Usually, tuning the fifo threshold should not be needed, as
the default (0xa) is reasonable.
floppy=<drive>,<type>,cmos
Sets the CMOS type of <drive> to <type>. This is mandatory if
you have more than two floppy drives (only two can be
described in the physical CMOS), or if your BIOS uses
non-standard CMOS types. The CMOS types are:
== ==================================
0 Use the value of the physical CMOS
1 5 1/4 DD
2 5 1/4 HD
3 3 1/2 DD
4 3 1/2 HD
5 3 1/2 ED
6 3 1/2 ED
16 unknown or not installed
== ==================================
(Note: there are two valid types for ED drives. This is because 5 was
initially chosen to represent floppy *tapes*, and 6 for ED drives.
AMI ignored this, and used 5 for ED drives. That's why the floppy
driver handles both.)
floppy=unexpected_interrupts
Print a warning message when an unexpected interrupt is received.
(default)
floppy=no_unexpected_interrupts / floppy=L40SX
Don't print a message when an unexpected interrupt is received. This
is needed on IBM L40SX laptops in certain video modes. (There seems
to be an interaction between video and floppy. The unexpected
interrupts affect only performance, and can be safely ignored.)
floppy=broken_dcl
Don't use the disk change line, but assume that the disk was
changed whenever the device node is reopened. Needed on some
boxes where the disk change line is broken or unsupported.
This should be regarded as a stopgap measure, indeed it makes
floppy operation less efficient due to unneeded cache
flushings, and slightly more unreliable. Please verify your
cable, connection and jumper settings if you have any DCL
problems. However, some older drives, and also some laptops
are known not to have a DCL.
floppy=debug
Print debugging messages.
floppy=messages
Print informational messages for some operations (disk change
notifications, warnings about over and underruns, and about
autodetection).
floppy=silent_dcl_clear
Uses a less noisy way to clear the disk change line (which
doesn't involve seeks). Implied by 'daring' option.
floppy=<nr>,irq
Sets the floppy IRQ to <nr> instead of 6.
floppy=<nr>,dma
Sets the floppy DMA channel to <nr> instead of 2.
floppy=slow
Use PS/2 stepping rate::
PS/2 floppies have much slower step rates than regular floppies.
It's been recommended that take about 1/4 of the default speed
in some more extreme cases.
Supporting utilities and additional documentation:
==================================================
Additional parameters of the floppy driver can be configured at
runtime. Utilities which do this can be found in the fdutils package.
This package also contains a new version of mtools which allows to
access high capacity disks (up to 1992K on a high density 3 1/2 disk!).
It also contains additional documentation about the floppy driver.
The latest version can be found at fdutils homepage:
http://fdutils.linux.lu
The fdutils releases can be found at:
http://fdutils.linux.lu/download.html
http://www.tux.org/pub/knaff/fdutils/
ftp://metalab.unc.edu/pub/Linux/utils/disk-management/
Reporting problems about the floppy driver
==========================================
If you have a question or a bug report about the floppy driver, mail
me at Alain.Knaff@poboxes.com . If you post to Usenet, preferably use
comp.os.linux.hardware. As the volume in these groups is rather high,
be sure to include the word "floppy" (or "FLOPPY") in the subject
line. If the reported problem happens when mounting floppy disks, be
sure to mention also the type of the filesystem in the subject line.
Be sure to read the FAQ before mailing/posting any bug reports!
Alain
Changelog
=========
10-30-2004 :
Cleanup, updating, add reference to module configuration.
James Nelson <james4765@gmail.com>
6-3-2000 :
Original Document

View File

@@ -0,0 +1,16 @@
.. SPDX-License-Identifier: GPL-2.0
===========================
The Linux RapidIO Subsystem
===========================
.. toctree::
:maxdepth: 1
floppy
nbd
paride
ramdisk
zram
drbd/index

View File

@@ -0,0 +1,31 @@
==================================
Network Block Device (TCP version)
==================================
1) Overview
-----------
What is it: With this compiled in the kernel (or as a module), Linux
can use a remote server as one of its block devices. So every time
the client computer wants to read, e.g., /dev/nb0, it sends a
request over TCP to the server, which will reply with the data read.
This can be used for stations with low disk space (or even diskless)
to borrow disk space from another computer.
Unlike NFS, it is possible to put any filesystem on it, etc.
For more information, or to download the nbd-client and nbd-server
tools, go to http://nbd.sf.net/.
The nbd kernel module need only be installed on the client
system, as the nbd-server is completely in userspace. In fact,
the nbd-server has been successfully ported to other operating
systems, including Windows.
A) NBD parameters
-----------------
max_part
Number of partitions per device (default: 0).
nbds_max
Number of block devices that should be initialized (default: 16).

View File

@@ -0,0 +1,439 @@
===================================
Linux and parallel port IDE devices
===================================
PARIDE v1.03 (c) 1997-8 Grant Guenther <grant@torque.net>
1. Introduction
===============
Owing to the simplicity and near universality of the parallel port interface
to personal computers, many external devices such as portable hard-disk,
CD-ROM, LS-120 and tape drives use the parallel port to connect to their
host computer. While some devices (notably scanners) use ad-hoc methods
to pass commands and data through the parallel port interface, most
external devices are actually identical to an internal model, but with
a parallel-port adapter chip added in. Some of the original parallel port
adapters were little more than mechanisms for multiplexing a SCSI bus.
(The Iomega PPA-3 adapter used in the ZIP drives is an example of this
approach). Most current designs, however, take a different approach.
The adapter chip reproduces a small ISA or IDE bus in the external device
and the communication protocol provides operations for reading and writing
device registers, as well as data block transfer functions. Sometimes,
the device being addressed via the parallel cable is a standard SCSI
controller like an NCR 5380. The "ditto" family of external tape
drives use the ISA replicator to interface a floppy disk controller,
which is then connected to a floppy-tape mechanism. The vast majority
of external parallel port devices, however, are now based on standard
IDE type devices, which require no intermediate controller. If one
were to open up a parallel port CD-ROM drive, for instance, one would
find a standard ATAPI CD-ROM drive, a power supply, and a single adapter
that interconnected a standard PC parallel port cable and a standard
IDE cable. It is usually possible to exchange the CD-ROM device with
any other device using the IDE interface.
The document describes the support in Linux for parallel port IDE
devices. It does not cover parallel port SCSI devices, "ditto" tape
drives or scanners. Many different devices are supported by the
parallel port IDE subsystem, including:
- MicroSolutions backpack CD-ROM
- MicroSolutions backpack PD/CD
- MicroSolutions backpack hard-drives
- MicroSolutions backpack 8000t tape drive
- SyQuest EZ-135, EZ-230 & SparQ drives
- Avatar Shark
- Imation Superdisk LS-120
- Maxell Superdisk LS-120
- FreeCom Power CD
- Hewlett-Packard 5GB and 8GB tape drives
- Hewlett-Packard 7100 and 7200 CD-RW drives
as well as most of the clone and no-name products on the market.
To support such a wide range of devices, PARIDE, the parallel port IDE
subsystem, is actually structured in three parts. There is a base
paride module which provides a registry and some common methods for
accessing the parallel ports. The second component is a set of
high-level drivers for each of the different types of supported devices:
=== =============
pd IDE disk
pcd ATAPI CD-ROM
pf ATAPI disk
pt ATAPI tape
pg ATAPI generic
=== =============
(Currently, the pg driver is only used with CD-R drives).
The high-level drivers function according to the relevant standards.
The third component of PARIDE is a set of low-level protocol drivers
for each of the parallel port IDE adapter chips. Thanks to the interest
and encouragement of Linux users from many parts of the world,
support is available for almost all known adapter protocols:
==== ====================================== ====
aten ATEN EH-100 (HK)
bpck Microsolutions backpack (US)
comm DataStor (old-type) "commuter" adapter (TW)
dstr DataStor EP-2000 (TW)
epat Shuttle EPAT (UK)
epia Shuttle EPIA (UK)
fit2 FIT TD-2000 (US)
fit3 FIT TD-3000 (US)
friq Freecom IQ cable (DE)
frpw Freecom Power (DE)
kbic KingByte KBIC-951A and KBIC-971A (TW)
ktti KT Technology PHd adapter (SG)
on20 OnSpec 90c20 (US)
on26 OnSpec 90c26 (US)
==== ====================================== ====
2. Using the PARIDE subsystem
=============================
While configuring the Linux kernel, you may choose either to build
the PARIDE drivers into your kernel, or to build them as modules.
In either case, you will need to select "Parallel port IDE device support"
as well as at least one of the high-level drivers and at least one
of the parallel port communication protocols. If you do not know
what kind of parallel port adapter is used in your drive, you could
begin by checking the file names and any text files on your DOS
installation floppy. Alternatively, you can look at the markings on
the adapter chip itself. That's usually sufficient to identify the
correct device.
You can actually select all the protocol modules, and allow the PARIDE
subsystem to try them all for you.
For the "brand-name" products listed above, here are the protocol
and high-level drivers that you would use:
================ ============ ====== ========
Manufacturer Model Driver Protocol
================ ============ ====== ========
MicroSolutions CD-ROM pcd bpck
MicroSolutions PD drive pf bpck
MicroSolutions hard-drive pd bpck
MicroSolutions 8000t tape pt bpck
SyQuest EZ, SparQ pd epat
Imation Superdisk pf epat
Maxell Superdisk pf friq
Avatar Shark pd epat
FreeCom CD-ROM pcd frpw
Hewlett-Packard 5GB Tape pt epat
Hewlett-Packard 7200e (CD) pcd epat
Hewlett-Packard 7200e (CD-R) pg epat
================ ============ ====== ========
2.1 Configuring built-in drivers
---------------------------------
We recommend that you get to know how the drivers work and how to
configure them as loadable modules, before attempting to compile a
kernel with the drivers built-in.
If you built all of your PARIDE support directly into your kernel,
and you have just a single parallel port IDE device, your kernel should
locate it automatically for you. If you have more than one device,
you may need to give some command line options to your bootloader
(eg: LILO), how to do that is beyond the scope of this document.
The high-level drivers accept a number of command line parameters, all
of which are documented in the source files in linux/drivers/block/paride.
By default, each driver will automatically try all parallel ports it
can find, and all protocol types that have been installed, until it finds
a parallel port IDE adapter. Once it finds one, the probe stops. So,
if you have more than one device, you will need to tell the drivers
how to identify them. This requires specifying the port address, the
protocol identification number and, for some devices, the drive's
chain ID. While your system is booting, a number of messages are
displayed on the console. Like all such messages, they can be
reviewed with the 'dmesg' command. Among those messages will be
some lines like::
paride: bpck registered as protocol 0
paride: epat registered as protocol 1
The numbers will always be the same until you build a new kernel with
different protocol selections. You should note these numbers as you
will need them to identify the devices.
If you happen to be using a MicroSolutions backpack device, you will
also need to know the unit ID number for each drive. This is usually
the last two digits of the drive's serial number (but read MicroSolutions'
documentation about this).
As an example, let's assume that you have a MicroSolutions PD/CD drive
with unit ID number 36 connected to the parallel port at 0x378, a SyQuest
EZ-135 connected to the chained port on the PD/CD drive and also an
Imation Superdisk connected to port 0x278. You could give the following
options on your boot command::
pd.drive0=0x378,1 pf.drive0=0x278,1 pf.drive1=0x378,0,36
In the last option, pf.drive1 configures device /dev/pf1, the 0x378
is the parallel port base address, the 0 is the protocol registration
number and 36 is the chain ID.
Please note: while PARIDE will work both with and without the
PARPORT parallel port sharing system that is included by the
"Parallel port support" option, PARPORT must be included and enabled
if you want to use chains of devices on the same parallel port.
2.2 Loading and configuring PARIDE as modules
----------------------------------------------
It is much faster and simpler to get to understand the PARIDE drivers
if you use them as loadable kernel modules.
Note 1:
using these drivers with the "kerneld" automatic module loading
system is not recommended for beginners, and is not documented here.
Note 2:
if you build PARPORT support as a loadable module, PARIDE must
also be built as loadable modules, and PARPORT must be loaded before
the PARIDE modules.
To use PARIDE, you must begin by::
insmod paride
this loads a base module which provides a registry for the protocols,
among other tasks.
Then, load as many of the protocol modules as you think you might need.
As you load each module, it will register the protocols that it supports,
and print a log message to your kernel log file and your console. For
example::
# insmod epat
paride: epat registered as protocol 0
# insmod kbic
paride: k951 registered as protocol 1
paride: k971 registered as protocol 2
Finally, you can load high-level drivers for each kind of device that
you have connected. By default, each driver will autoprobe for a single
device, but you can support up to four similar devices by giving their
individual co-ordinates when you load the driver.
For example, if you had two no-name CD-ROM drives both using the
KingByte KBIC-951A adapter, one on port 0x378 and the other on 0x3bc
you could give the following command::
# insmod pcd drive0=0x378,1 drive1=0x3bc,1
For most adapters, giving a port address and protocol number is sufficient,
but check the source files in linux/drivers/block/paride for more
information. (Hopefully someone will write some man pages one day !).
As another example, here's what happens when PARPORT is installed, and
a SyQuest EZ-135 is attached to port 0x378::
# insmod paride
paride: version 1.0 installed
# insmod epat
paride: epat registered as protocol 0
# insmod pd
pd: pd version 1.0, major 45, cluster 64, nice 0
pda: Sharing parport1 at 0x378
pda: epat 1.0, Shuttle EPAT chip c3 at 0x378, mode 5 (EPP-32), delay 1
pda: SyQuest EZ135A, 262144 blocks [128M], (512/16/32), removable media
pda: pda1
Note that the last line is the output from the generic partition table
scanner - in this case it reports that it has found a disk with one partition.
2.3 Using a PARIDE device
--------------------------
Once the drivers have been loaded, you can access PARIDE devices in the
same way as their traditional counterparts. You will probably need to
create the device "special files". Here is a simple script that you can
cut to a file and execute::
#!/bin/bash
#
# mkd -- a script to create the device special files for the PARIDE subsystem
#
function mkdev {
mknod $1 $2 $3 $4 ; chmod 0660 $1 ; chown root:disk $1
}
#
function pd {
D=$( printf \\$( printf "x%03x" $[ $1 + 97 ] ) )
mkdev pd$D b 45 $[ $1 * 16 ]
for P in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
do mkdev pd$D$P b 45 $[ $1 * 16 + $P ]
done
}
#
cd /dev
#
for u in 0 1 2 3 ; do pd $u ; done
for u in 0 1 2 3 ; do mkdev pcd$u b 46 $u ; done
for u in 0 1 2 3 ; do mkdev pf$u b 47 $u ; done
for u in 0 1 2 3 ; do mkdev pt$u c 96 $u ; done
for u in 0 1 2 3 ; do mkdev npt$u c 96 $[ $u + 128 ] ; done
for u in 0 1 2 3 ; do mkdev pg$u c 97 $u ; done
#
# end of mkd
With the device files and drivers in place, you can access PARIDE devices
like any other Linux device. For example, to mount a CD-ROM in pcd0, use::
mount /dev/pcd0 /cdrom
If you have a fresh Avatar Shark cartridge, and the drive is pda, you
might do something like::
fdisk /dev/pda -- make a new partition table with
partition 1 of type 83
mke2fs /dev/pda1 -- to build the file system
mkdir /shark -- make a place to mount the disk
mount /dev/pda1 /shark
Devices like the Imation superdisk work in the same way, except that
they do not have a partition table. For example to make a 120MB
floppy that you could share with a DOS system::
mkdosfs /dev/pf0
mount /dev/pf0 /mnt
2.4 The pf driver
------------------
The pf driver is intended for use with parallel port ATAPI disk
devices. The most common devices in this category are PD drives
and LS-120 drives. Traditionally, media for these devices are not
partitioned. Consequently, the pf driver does not support partitioned
media. This may be changed in a future version of the driver.
2.5 Using the pt driver
------------------------
The pt driver for parallel port ATAPI tape drives is a minimal driver.
It does not yet support many of the standard tape ioctl operations.
For best performance, a block size of 32KB should be used. You will
probably want to set the parallel port delay to 0, if you can.
2.6 Using the pg driver
------------------------
The pg driver can be used in conjunction with the cdrecord program
to create CD-ROMs. Please get cdrecord version 1.6.1 or later
from ftp://ftp.fokus.gmd.de/pub/unix/cdrecord/ . To record CD-R media
your parallel port should ideally be set to EPP mode, and the "port delay"
should be set to 0. With those settings it is possible to record at 2x
speed without any buffer underruns. If you cannot get the driver to work
in EPP mode, try to use "bidirectional" or "PS/2" mode and 1x speeds only.
3. Troubleshooting
==================
3.1 Use EPP mode if you can
----------------------------
The most common problems that people report with the PARIDE drivers
concern the parallel port CMOS settings. At this time, none of the
PARIDE protocol modules support ECP mode, or any ECP combination modes.
If you are able to do so, please set your parallel port into EPP mode
using your CMOS setup procedure.
3.2 Check the port delay
-------------------------
Some parallel ports cannot reliably transfer data at full speed. To
offset the errors, the PARIDE protocol modules introduce a "port
delay" between each access to the i/o ports. Each protocol sets
a default value for this delay. In most cases, the user can override
the default and set it to 0 - resulting in somewhat higher transfer
rates. In some rare cases (especially with older 486 systems) the
default delays are not long enough. if you experience corrupt data
transfers, or unexpected failures, you may wish to increase the
port delay. The delay can be programmed using the "driveN" parameters
to each of the high-level drivers. Please see the notes above, or
read the comments at the beginning of the driver source files in
linux/drivers/block/paride.
3.3 Some drives need a printer reset
-------------------------------------
There appear to be a number of "noname" external drives on the market
that do not always power up correctly. We have noticed this with some
drives based on OnSpec and older Freecom adapters. In these rare cases,
the adapter can often be reinitialised by issuing a "printer reset" on
the parallel port. As the reset operation is potentially disruptive in
multiple device environments, the PARIDE drivers will not do it
automatically. You can however, force a printer reset by doing::
insmod lp reset=1
rmmod lp
If you have one of these marginal cases, you should probably build
your paride drivers as modules, and arrange to do the printer reset
before loading the PARIDE drivers.
3.4 Use the verbose option and dmesg if you need help
------------------------------------------------------
While a lot of testing has gone into these drivers to make them work
as smoothly as possible, problems will arise. If you do have problems,
please check all the obvious things first: does the drive work in
DOS with the manufacturer's drivers ? If that doesn't yield any useful
clues, then please make sure that only one drive is hooked to your system,
and that either (a) PARPORT is enabled or (b) no other device driver
is using your parallel port (check in /proc/ioports). Then, load the
appropriate drivers (you can load several protocol modules if you want)
as in::
# insmod paride
# insmod epat
# insmod bpck
# insmod kbic
...
# insmod pd verbose=1
(using the correct driver for the type of device you have, of course).
The verbose=1 parameter will cause the drivers to log a trace of their
activity as they attempt to locate your drive.
Use 'dmesg' to capture a log of all the PARIDE messages (any messages
beginning with paride:, a protocol module's name or a driver's name) and
include that with your bug report. You can submit a bug report in one
of two ways. Either send it directly to the author of the PARIDE suite,
by e-mail to grant@torque.net, or join the linux-parport mailing list
and post your report there.
3.5 For more information or help
---------------------------------
You can join the linux-parport mailing list by sending a mail message
to:
linux-parport-request@torque.net
with the single word::
subscribe
in the body of the mail message (not in the subject line). Please be
sure that your mail program is correctly set up when you do this, as
the list manager is a robot that will subscribe you using the reply
address in your mail headers. REMOVE any anti-spam gimmicks you may
have in your mail headers, when sending mail to the list server.
You might also find some useful information on the linux-parport
web pages (although they are not always up to date) at
http://web.archive.org/web/%2E/http://www.torque.net/parport/

View File

@@ -0,0 +1,177 @@
==========================================
Using the RAM disk block device with Linux
==========================================
.. Contents:
1) Overview
2) Kernel Command Line Parameters
3) Using "rdev -r"
4) An Example of Creating a Compressed RAM Disk
1) Overview
-----------
The RAM disk driver is a way to use main system memory as a block device. It
is required for initrd, an initial filesystem used if you need to load modules
in order to access the root filesystem (see Documentation/admin-guide/initrd.rst). It can
also be used for a temporary filesystem for crypto work, since the contents
are erased on reboot.
The RAM disk dynamically grows as more space is required. It does this by using
RAM from the buffer cache. The driver marks the buffers it is using as dirty
so that the VM subsystem does not try to reclaim them later.
The RAM disk supports up to 16 RAM disks by default, and can be reconfigured
to support an unlimited number of RAM disks (at your own risk). Just change
the configuration symbol BLK_DEV_RAM_COUNT in the Block drivers config menu
and (re)build the kernel.
To use RAM disk support with your system, run './MAKEDEV ram' from the /dev
directory. RAM disks are all major number 1, and start with minor number 0
for /dev/ram0, etc. If used, modern kernels use /dev/ram0 for an initrd.
The new RAM disk also has the ability to load compressed RAM disk images,
allowing one to squeeze more programs onto an average installation or
rescue floppy disk.
2) Parameters
---------------------------------
2a) Kernel Command Line Parameters
ramdisk_size=N
Size of the ramdisk.
This parameter tells the RAM disk driver to set up RAM disks of N k size. The
default is 4096 (4 MB).
2b) Module parameters
rd_nr
/dev/ramX devices created.
max_part
Maximum partition number.
rd_size
See ramdisk_size.
3) Using "rdev -r"
------------------
The usage of the word (two bytes) that "rdev -r" sets in the kernel image is
as follows. The low 11 bits (0 -> 10) specify an offset (in 1 k blocks) of up
to 2 MB (2^11) of where to find the RAM disk (this used to be the size). Bit
14 indicates that a RAM disk is to be loaded, and bit 15 indicates whether a
prompt/wait sequence is to be given before trying to read the RAM disk. Since
the RAM disk dynamically grows as data is being written into it, a size field
is not required. Bits 11 to 13 are not currently used and may as well be zero.
These numbers are no magical secrets, as seen below::
./arch/x86/kernel/setup.c:#define RAMDISK_IMAGE_START_MASK 0x07FF
./arch/x86/kernel/setup.c:#define RAMDISK_PROMPT_FLAG 0x8000
./arch/x86/kernel/setup.c:#define RAMDISK_LOAD_FLAG 0x4000
Consider a typical two floppy disk setup, where you will have the
kernel on disk one, and have already put a RAM disk image onto disk #2.
Hence you want to set bits 0 to 13 as 0, meaning that your RAM disk
starts at an offset of 0 kB from the beginning of the floppy.
The command line equivalent is: "ramdisk_start=0"
You want bit 14 as one, indicating that a RAM disk is to be loaded.
The command line equivalent is: "load_ramdisk=1"
You want bit 15 as one, indicating that you want a prompt/keypress
sequence so that you have a chance to switch floppy disks.
The command line equivalent is: "prompt_ramdisk=1"
Putting that together gives 2^15 + 2^14 + 0 = 49152 for an rdev word.
So to create disk one of the set, you would do::
/usr/src/linux# cat arch/x86/boot/zImage > /dev/fd0
/usr/src/linux# rdev /dev/fd0 /dev/fd0
/usr/src/linux# rdev -r /dev/fd0 49152
If you make a boot disk that has LILO, then for the above, you would use::
append = "ramdisk_start=0 load_ramdisk=1 prompt_ramdisk=1"
Since the default start = 0 and the default prompt = 1, you could use::
append = "load_ramdisk=1"
4) An Example of Creating a Compressed RAM Disk
-----------------------------------------------
To create a RAM disk image, you will need a spare block device to
construct it on. This can be the RAM disk device itself, or an
unused disk partition (such as an unmounted swap partition). For this
example, we will use the RAM disk device, "/dev/ram0".
Note: This technique should not be done on a machine with less than 8 MB
of RAM. If using a spare disk partition instead of /dev/ram0, then this
restriction does not apply.
a) Decide on the RAM disk size that you want. Say 2 MB for this example.
Create it by writing to the RAM disk device. (This step is not currently
required, but may be in the future.) It is wise to zero out the
area (esp. for disks) so that maximal compression is achieved for
the unused blocks of the image that you are about to create::
dd if=/dev/zero of=/dev/ram0 bs=1k count=2048
b) Make a filesystem on it. Say ext2fs for this example::
mke2fs -vm0 /dev/ram0 2048
c) Mount it, copy the files you want to it (eg: /etc/* /dev/* ...)
and unmount it again.
d) Compress the contents of the RAM disk. The level of compression
will be approximately 50% of the space used by the files. Unused
space on the RAM disk will compress to almost nothing::
dd if=/dev/ram0 bs=1k count=2048 | gzip -v9 > /tmp/ram_image.gz
e) Put the kernel onto the floppy::
dd if=zImage of=/dev/fd0 bs=1k
f) Put the RAM disk image onto the floppy, after the kernel. Use an offset
that is slightly larger than the kernel, so that you can put another
(possibly larger) kernel onto the same floppy later without overlapping
the RAM disk image. An offset of 400 kB for kernels about 350 kB in
size would be reasonable. Make sure offset+size of ram_image.gz is
not larger than the total space on your floppy (usually 1440 kB)::
dd if=/tmp/ram_image.gz of=/dev/fd0 bs=1k seek=400
g) Use "rdev" to set the boot device, RAM disk offset, prompt flag, etc.
For prompt_ramdisk=1, load_ramdisk=1, ramdisk_start=400, one would
have 2^15 + 2^14 + 400 = 49552::
rdev /dev/fd0 /dev/fd0
rdev -r /dev/fd0 49552
That is it. You now have your boot/root compressed RAM disk floppy. Some
users may wish to combine steps (d) and (f) by using a pipe.
Paul Gortmaker 12/95
Changelog:
----------
10-22-04 :
Updated to reflect changes in command line options, remove
obsolete references, general cleanup.
James Nelson (james4765@gmail.com)
12-95 :
Original Document

View File

@@ -0,0 +1,422 @@
========================================
zram: Compressed RAM based block devices
========================================
Introduction
============
The zram module creates RAM based block devices named /dev/zram<id>
(<id> = 0, 1, ...). Pages written to these disks are compressed and stored
in memory itself. These disks allow very fast I/O and compression provides
good amounts of memory savings. Some of the usecases include /tmp storage,
use as swap disks, various caches under /var and maybe many more :)
Statistics for individual zram devices are exported through sysfs nodes at
/sys/block/zram<id>/
Usage
=====
There are several ways to configure and manage zram device(-s):
a) using zram and zram_control sysfs attributes
b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org).
In this document we will describe only 'manual' zram configuration steps,
IOW, zram and zram_control sysfs attributes.
In order to get a better idea about zramctl please consult util-linux
documentation, zramctl man-page or `zramctl --help`. Please be informed
that zram maintainers do not develop/maintain util-linux or zramctl, should
you have any questions please contact util-linux@vger.kernel.org
Following shows a typical sequence of steps for using zram.
WARNING
=======
For the sake of simplicity we skip error checking parts in most of the
examples below. However, it is your sole responsibility to handle errors.
zram sysfs attributes always return negative values in case of errors.
The list of possible return codes:
======== =============================================================
-EBUSY an attempt to modify an attribute that cannot be changed once
the device has been initialised. Please reset device first;
-ENOMEM zram was not able to allocate enough memory to fulfil your
needs;
-EINVAL invalid input has been provided.
======== =============================================================
If you use 'echo', the returned value that is changed by 'echo' utility,
and, in general case, something like::
echo 3 > /sys/block/zram0/max_comp_streams
if [ $? -ne 0 ];
handle_error
fi
should suffice.
1) Load Module
==============
::
modprobe zram num_devices=4
This creates 4 devices: /dev/zram{0,1,2,3}
num_devices parameter is optional and tells zram how many devices should be
pre-created. Default: 1.
2) Set max number of compression streams
========================================
Regardless the value passed to this attribute, ZRAM will always
allocate multiple compression streams - one per online CPUs - thus
allowing several concurrent compression operations. The number of
allocated compression streams goes down when some of the CPUs
become offline. There is no single-compression-stream mode anymore,
unless you are running a UP system or has only 1 CPU online.
To find out how many streams are currently available::
cat /sys/block/zram0/max_comp_streams
3) Select compression algorithm
===============================
Using comp_algorithm device attribute one can see available and
currently selected (shown in square brackets) compression algorithms,
change selected compression algorithm (once the device is initialised
there is no way to change compression algorithm).
Examples::
#show supported compression algorithms
cat /sys/block/zram0/comp_algorithm
lzo [lz4]
#select lzo compression algorithm
echo lzo > /sys/block/zram0/comp_algorithm
For the time being, the `comp_algorithm` content does not necessarily
show every compression algorithm supported by the kernel. We keep this
list primarily to simplify device configuration and one can configure
a new device with a compression algorithm that is not listed in
`comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API
and, if some of the algorithms were built as modules, it's impossible
to list all of them using, for instance, /proc/crypto or any other
method. This, however, has an advantage of permitting the usage of
custom crypto compression modules (implementing S/W or H/W compression).
4) Set Disksize
===============
Set disk size by writing the value to sysfs node 'disksize'.
The value can be either in bytes or you can use mem suffixes.
Examples::
# Initialize /dev/zram0 with 50MB disksize
echo $((50*1024*1024)) > /sys/block/zram0/disksize
# Using mem suffixes
echo 256K > /sys/block/zram0/disksize
echo 512M > /sys/block/zram0/disksize
echo 1G > /sys/block/zram0/disksize
Note:
There is little point creating a zram of greater than twice the size of memory
since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
size of the disk when not in use so a huge zram is wasteful.
5) Set memory limit: Optional
=============================
Set memory limit by writing the value to sysfs node 'mem_limit'.
The value can be either in bytes or you can use mem suffixes.
In addition, you could change the value in runtime.
Examples::
# limit /dev/zram0 with 50MB memory
echo $((50*1024*1024)) > /sys/block/zram0/mem_limit
# Using mem suffixes
echo 256K > /sys/block/zram0/mem_limit
echo 512M > /sys/block/zram0/mem_limit
echo 1G > /sys/block/zram0/mem_limit
# To disable memory limit
echo 0 > /sys/block/zram0/mem_limit
6) Activate
===========
::
mkswap /dev/zram0
swapon /dev/zram0
mkfs.ext4 /dev/zram1
mount /dev/zram1 /tmp
7) Add/remove zram devices
==========================
zram provides a control interface, which enables dynamic (on-demand) device
addition and removal.
In order to add a new /dev/zramX device, perform read operation on hot_add
attribute. This will return either new device's device id (meaning that you
can use /dev/zram<id>) or error code.
Example::
cat /sys/class/zram-control/hot_add
1
To remove the existing /dev/zramX device (where X is a device id)
execute::
echo X > /sys/class/zram-control/hot_remove
8) Stats
========
Per-device statistics are exported as various nodes under /sys/block/zram<id>/
A brief description of exported device attributes. For more details please
read Documentation/ABI/testing/sysfs-block-zram.
====================== ====== ===============================================
Name access description
====================== ====== ===============================================
disksize RW show and set the device's disk size
initstate RO shows the initialization state of the device
reset WO trigger device reset
mem_used_max WO reset the `mem_used_max` counter (see later)
mem_limit WO specifies the maximum amount of memory ZRAM can
use to store the compressed data
writeback_limit WO specifies the maximum amount of write IO zram
can write out to backing device as 4KB unit
writeback_limit_enable RW show and set writeback_limit feature
max_comp_streams RW the number of possible concurrent compress
operations
comp_algorithm RW show and change the compression algorithm
compact WO trigger memory compaction
debug_stat RO this file is used for zram debugging purposes
backing_dev RW set up backend storage for zram to write out
idle WO mark allocated slot as idle
====================== ====== ===============================================
User space is advised to use the following files to read the device statistics.
File /sys/block/zram<id>/stat
Represents block layer statistics. Read Documentation/block/stat.rst for
details.
File /sys/block/zram<id>/io_stat
The stat file represents device's I/O statistics not accounted by block
layer and, thus, not available in zram<id>/stat file. It consists of a
single line of text and contains the following stats separated by
whitespace:
============= =============================================================
failed_reads The number of failed reads
failed_writes The number of failed writes
invalid_io The number of non-page-size-aligned I/O requests
notify_free Depending on device usage scenario it may account
a) the number of pages freed because of swap slot free
notifications
b) the number of pages freed because of
REQ_OP_DISCARD requests sent by bio. The former ones are
sent to a swap block device when a swap slot is freed,
which implies that this disk is being used as a swap disk.
The latter ones are sent by filesystem mounted with
discard option, whenever some data blocks are getting
discarded.
============= =============================================================
File /sys/block/zram<id>/mm_stat
The stat file represents device's mm statistics. It consists of a single
line of text and contains the following stats separated by whitespace:
================ =============================================================
orig_data_size uncompressed size of data stored in this disk.
This excludes same-element-filled pages (same_pages) since
no memory is allocated for them.
Unit: bytes
compr_data_size compressed size of data stored in this disk
mem_used_total the amount of memory allocated for this disk. This
includes allocator fragmentation and metadata overhead,
allocated for this disk. So, allocator space efficiency
can be calculated using compr_data_size and this statistic.
Unit: bytes
mem_limit the maximum amount of memory ZRAM can use to store
the compressed data
mem_used_max the maximum amount of memory zram have consumed to
store the data
same_pages the number of same element filled pages written to this disk.
No memory is allocated for such pages.
pages_compacted the number of pages freed during compaction
huge_pages the number of incompressible pages
================ =============================================================
File /sys/block/zram<id>/bd_stat
The stat file represents device's backing device statistics. It consists of
a single line of text and contains the following stats separated by whitespace:
============== =============================================================
bd_count size of data written in backing device.
Unit: 4K bytes
bd_reads the number of reads from backing device
Unit: 4K bytes
bd_writes the number of writes to backing device
Unit: 4K bytes
============== =============================================================
9) Deactivate
=============
::
swapoff /dev/zram0
umount /dev/zram1
10) Reset
=========
Write any positive value to 'reset' sysfs node::
echo 1 > /sys/block/zram0/reset
echo 1 > /sys/block/zram1/reset
This frees all the memory allocated for the given device and
resets the disksize to zero. You must set the disksize again
before reusing the device.
Optional Feature
================
writeback
---------
With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
to backing storage rather than keeping it in memory.
To use the feature, admin should set up backing device via::
echo /dev/sda5 > /sys/block/zramX/backing_dev
before disksize setting. It supports only partition at this moment.
If admin want to use incompressible page writeback, they could do via::
echo huge > /sys/block/zramX/write
To use idle page writeback, first, user need to declare zram pages
as idle::
echo all > /sys/block/zramX/idle
From now on, any pages on zram are idle pages. The idle mark
will be removed until someone request access of the block.
IOW, unless there is access request, those pages are still idle pages.
Admin can request writeback of those idle pages at right timing via::
echo idle > /sys/block/zramX/writeback
With the command, zram writeback idle pages from memory to the storage.
If there are lots of write IO with flash device, potentially, it has
flash wearout problem so that admin needs to design write limitation
to guarantee storage health for entire product life.
To overcome the concern, zram supports "writeback_limit" feature.
The "writeback_limit_enable"'s default value is 0 so that it doesn't limit
any writeback. IOW, if admin want to apply writeback budget, he should
enable writeback_limit_enable via::
$ echo 1 > /sys/block/zramX/writeback_limit_enable
Once writeback_limit_enable is set, zram doesn't allow any writeback
until admin set the budget via /sys/block/zramX/writeback_limit.
(If admin doesn't enable writeback_limit_enable, writeback_limit's value
assigned via /sys/block/zramX/writeback_limit is meaninless.)
If admin want to limit writeback as per-day 400M, he could do it
like below::
$ MB_SHIFT=20
$ 4K_SHIFT=12
$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit.
$ echo 1 > /sys/block/zram0/writeback_limit_enable
If admin want to allow further write again once the bugdet is exausted,
he could do it like below::
$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit
If admin want to see remaining writeback budget since he set::
$ cat /sys/block/zramX/writeback_limit
If admin want to disable writeback limit, he could do::
$ echo 0 > /sys/block/zramX/writeback_limit_enable
The writeback_limit count will reset whenever you reset zram(e.g.,
system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
writeback happened until you reset the zram to allocate extra writeback
budget in next setting is user's job.
If admin want to measure writeback count in a certain period, he could
know it via /sys/block/zram0/bd_stat's 3rd column.
memory tracking
===============
With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
zram block. It could be useful to catch cold or incompressible
pages of the process with*pagemap.
If you enable the feature, you could see block state via
/sys/kernel/debug/zram/zram0/block_state". The output is as follows::
300 75.033841 .wh.
301 63.806904 s...
302 63.806919 ..hi
First column
zram's block index.
Second column
access time since the system was booted
Third column
state of the block:
s:
same page
w:
written page to backing store
h:
huge page
i:
idle page
First line of above example says 300th block is accessed at 75.033841sec
and the block's state is huge so it is written back to the backing
storage. It's a debugging feature so anyone shouldn't rely on it to work
properly.
Nitin Gupta
ngupta@vflare.org

View File

@@ -0,0 +1,124 @@
=============
btmrvl driver
=============
All commands are used via debugfs interface.
Set/get driver configurations
=============================
Path: /debug/btmrvl/config/
gpiogap=[n], hscfgcmd
These commands are used to configure the host sleep parameters::
bit 8:0 -- Gap
bit 16:8 -- GPIO
where GPIO is the pin number of GPIO used to wake up the host.
It could be any valid GPIO pin# (e.g. 0-7) or 0xff (SDIO interface
wakeup will be used instead).
where Gap is the gap in milli seconds between wakeup signal and
wakeup event, or 0xff for special host sleep setting.
Usage::
# Use SDIO interface to wake up the host and set GAP to 0x80:
echo 0xff80 > /debug/btmrvl/config/gpiogap
echo 1 > /debug/btmrvl/config/hscfgcmd
# Use GPIO pin #3 to wake up the host and set GAP to 0xff:
echo 0x03ff > /debug/btmrvl/config/gpiogap
echo 1 > /debug/btmrvl/config/hscfgcmd
psmode=[n], pscmd
These commands are used to enable/disable auto sleep mode
where the option is::
1 -- Enable auto sleep mode
0 -- Disable auto sleep mode
Usage::
# Enable auto sleep mode
echo 1 > /debug/btmrvl/config/psmode
echo 1 > /debug/btmrvl/config/pscmd
# Disable auto sleep mode
echo 0 > /debug/btmrvl/config/psmode
echo 1 > /debug/btmrvl/config/pscmd
hsmode=[n], hscmd
These commands are used to enable host sleep or wake up firmware
where the option is::
1 -- Enable host sleep
0 -- Wake up firmware
Usage::
# Enable host sleep
echo 1 > /debug/btmrvl/config/hsmode
echo 1 > /debug/btmrvl/config/hscmd
# Wake up firmware
echo 0 > /debug/btmrvl/config/hsmode
echo 1 > /debug/btmrvl/config/hscmd
Get driver status
=================
Path: /debug/btmrvl/status/
Usage::
cat /debug/btmrvl/status/<args>
where the args are:
curpsmode
This command displays current auto sleep status.
psstate
This command display the power save state.
hsstate
This command display the host sleep state.
txdnldrdy
This command displays the value of Tx download ready flag.
Issuing a raw hci command
=========================
Use hcitool to issue raw hci command, refer to hcitool manual
Usage::
Hcitool cmd <ogf> <ocf> [Parameters]
Interface Control Command::
hcitool cmd 0x3f 0x5b 0xf5 0x01 0x00 --Enable All interface
hcitool cmd 0x3f 0x5b 0xf5 0x01 0x01 --Enable Wlan interface
hcitool cmd 0x3f 0x5b 0xf5 0x01 0x02 --Enable BT interface
hcitool cmd 0x3f 0x5b 0xf5 0x00 0x00 --Disable All interface
hcitool cmd 0x3f 0x5b 0xf5 0x00 0x01 --Disable Wlan interface
hcitool cmd 0x3f 0x5b 0xf5 0x00 0x02 --Disable BT interface
SD8688 firmware
===============
Images:
- /lib/firmware/sd8688_helper.bin
- /lib/firmware/sd8688.bin
The images can be downloaded from:
git.infradead.org/users/dwmw2/linux-firmware.git/libertas/

View File

@@ -90,9 +90,9 @@ the disk is not available then you have three options:
run a null modem to a second machine and capture the output there
using your favourite communication program. Minicom works well.
(3) Use Kdump (see Documentation/kdump/kdump.txt),
(3) Use Kdump (see Documentation/admin-guide/kdump/kdump.rst),
extract the kernel ring buffer from old memory with using dmesg
gdbmacro in Documentation/kdump/gdbmacros.txt.
gdbmacro in Documentation/admin-guide/kdump/gdbmacros.txt.
Finding the bug's location
--------------------------

View File

@@ -0,0 +1,302 @@
===================
Block IO Controller
===================
Overview
========
cgroup subsys "blkio" implements the block io controller. There seems to be
a need of various kinds of IO control policies (like proportional BW, max BW)
both at leaf nodes as well as at intermediate nodes in a storage hierarchy.
Plan is to use the same cgroup based management interface for blkio controller
and based on user options switch IO policies in the background.
One IO control policy is throttling policy which can be used to
specify upper IO rate limits on devices. This policy is implemented in
generic block layer and can be used on leaf nodes as well as higher
level logical devices like device mapper.
HOWTO
=====
Throttling/Upper Limit policy
-----------------------------
- Enable Block IO controller::
CONFIG_BLK_CGROUP=y
- Enable throttling in block layer::
CONFIG_BLK_DEV_THROTTLING=y
- Mount blkio controller (see cgroups.txt, Why are cgroups needed?)::
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
- Specify a bandwidth rate on particular device for root group. The format
for policy is "<major>:<minor> <bytes_per_second>"::
echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
Above will put a limit of 1MB/second on reads happening for root group
on device having major/minor number 8:16.
- Run dd to read a file and see if rate is throttled to 1MB/s or not::
# dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
Limits for writes can be put using blkio.throttle.write_bps_device file.
Hierarchical Cgroups
====================
Throttling implements hierarchy support; however,
throttling's hierarchy support is enabled iff "sane_behavior" is
enabled from cgroup side, which currently is a development option and
not publicly available.
If somebody created a hierarchy like as follows::
root
/ \
test1 test2
|
test3
Throttling with "sane_behavior" will handle the
hierarchy correctly. For throttling, all limits apply
to the whole subtree while all statistics are local to the IOs
directly generated by tasks in that cgroup.
Throttling without "sane_behavior" enabled from cgroup side will
practically treat all groups at same level as if it looks like the
following::
pivot
/ / \ \
root test1 test2 test3
Various user visible config options
===================================
CONFIG_BLK_CGROUP
- Block IO controller.
CONFIG_BFQ_CGROUP_DEBUG
- Debug help. Right now some additional stats file show up in cgroup
if this option is enabled.
CONFIG_BLK_DEV_THROTTLING
- Enable block device throttling support in block layer.
Details of cgroup files
=======================
Proportional weight policy files
--------------------------------
- blkio.weight
- Specifies per cgroup weight. This is default weight of the group
on all the devices until and unless overridden by per device rule.
(See blkio.weight_device).
Currently allowed range of weights is from 10 to 1000.
- blkio.weight_device
- One can specify per cgroup per device rules using this interface.
These rules override the default value of group weight as specified
by blkio.weight.
Following is the format::
# echo dev_maj:dev_minor weight > blkio.weight_device
Configure weight=300 on /dev/sdb (8:16) in this cgroup::
# echo 8:16 300 > blkio.weight_device
# cat blkio.weight_device
dev weight
8:16 300
Configure weight=500 on /dev/sda (8:0) in this cgroup::
# echo 8:0 500 > blkio.weight_device
# cat blkio.weight_device
dev weight
8:0 500
8:16 300
Remove specific weight for /dev/sda in this cgroup::
# echo 8:0 0 > blkio.weight_device
# cat blkio.weight_device
dev weight
8:16 300
- blkio.leaf_weight[_device]
- Equivalents of blkio.weight[_device] for the purpose of
deciding how much weight tasks in the given cgroup has while
competing with the cgroup's child cgroups. For details,
please refer to Documentation/block/cfq-iosched.txt.
- blkio.time
- disk time allocated to cgroup per device in milliseconds. First
two fields specify the major and minor number of the device and
third field specifies the disk time allocated to group in
milliseconds.
- blkio.sectors
- number of sectors transferred to/from disk by the group. First
two fields specify the major and minor number of the device and
third field specifies the number of sectors transferred by the
group to/from the device.
- blkio.io_service_bytes
- Number of bytes transferred to/from the disk by the group. These
are further divided by the type of operation - read or write, sync
or async. First two fields specify the major and minor number of the
device, third field specifies the operation type and the fourth field
specifies the number of bytes.
- blkio.io_serviced
- Number of IOs (bio) issued to the disk by the group. These
are further divided by the type of operation - read or write, sync
or async. First two fields specify the major and minor number of the
device, third field specifies the operation type and the fourth field
specifies the number of IOs.
- blkio.io_service_time
- Total amount of time between request dispatch and request completion
for the IOs done by this cgroup. This is in nanoseconds to make it
meaningful for flash devices too. For devices with queue depth of 1,
this time represents the actual service time. When queue_depth > 1,
that is no longer true as requests may be served out of order. This
may cause the service time for a given IO to include the service time
of multiple IOs when served out of order which may result in total
io_service_time > actual time elapsed. This time is further divided by
the type of operation - read or write, sync or async. First two fields
specify the major and minor number of the device, third field
specifies the operation type and the fourth field specifies the
io_service_time in ns.
- blkio.io_wait_time
- Total amount of time the IOs for this cgroup spent waiting in the
scheduler queues for service. This can be greater than the total time
elapsed since it is cumulative io_wait_time for all IOs. It is not a
measure of total time the cgroup spent waiting but rather a measure of
the wait_time for its individual IOs. For devices with queue_depth > 1
this metric does not include the time spent waiting for service once
the IO is dispatched to the device but till it actually gets serviced
(there might be a time lag here due to re-ordering of requests by the
device). This is in nanoseconds to make it meaningful for flash
devices too. This time is further divided by the type of operation -
read or write, sync or async. First two fields specify the major and
minor number of the device, third field specifies the operation type
and the fourth field specifies the io_wait_time in ns.
- blkio.io_merged
- Total number of bios/requests merged into requests belonging to this
cgroup. This is further divided by the type of operation - read or
write, sync or async.
- blkio.io_queued
- Total number of requests queued up at any given instant for this
cgroup. This is further divided by the type of operation - read or
write, sync or async.
- blkio.avg_queue_size
- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
The average queue size for this cgroup over the entire time of this
cgroup's existence. Queue size samples are taken each time one of the
queues of this cgroup gets a timeslice.
- blkio.group_wait_time
- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
This is the amount of time the cgroup had to wait since it became busy
(i.e., went from 0 to 1 request queued) to get a timeslice for one of
its queues. This is different from the io_wait_time which is the
cumulative total of the amount of time spent by each IO in that cgroup
waiting in the scheduler queue. This is in nanoseconds. If this is
read when the cgroup is in a waiting (for timeslice) state, the stat
will only report the group_wait_time accumulated till the last time it
got a timeslice and will not include the current delta.
- blkio.empty_time
- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
This is the amount of time a cgroup spends without any pending
requests when not being served, i.e., it does not include any time
spent idling for one of the queues of the cgroup. This is in
nanoseconds. If this is read when the cgroup is in an empty state,
the stat will only report the empty_time accumulated till the last
time it had a pending request and will not include the current delta.
- blkio.idle_time
- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
This is the amount of time spent by the IO scheduler idling for a
given cgroup in anticipation of a better request than the existing ones
from other queues/cgroups. This is in nanoseconds. If this is read
when the cgroup is in an idling state, the stat will only report the
idle_time accumulated till the last idle period and will not include
the current delta.
- blkio.dequeue
- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This
gives the statistics about how many a times a group was dequeued
from service tree of the device. First two fields specify the major
and minor number of the device and third field specifies the number
of times a group was dequeued from a particular device.
- blkio.*_recursive
- Recursive version of various stats. These files show the
same information as their non-recursive counterparts but
include stats from all the descendant cgroups.
Throttling/Upper limit policy files
-----------------------------------
- blkio.throttle.read_bps_device
- Specifies upper limit on READ rate from the device. IO rate is
specified in bytes per second. Rules are per device. Following is
the format::
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
- blkio.throttle.write_bps_device
- Specifies upper limit on WRITE rate to the device. IO rate is
specified in bytes per second. Rules are per device. Following is
the format::
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
- blkio.throttle.read_iops_device
- Specifies upper limit on READ rate from the device. IO rate is
specified in IO per second. Rules are per device. Following is
the format::
echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
- blkio.throttle.write_iops_device
- Specifies upper limit on WRITE rate to the device. IO rate is
specified in io per second. Rules are per device. Following is
the format::
echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
Note: If both BW and IOPS rules are specified for a device, then IO is
subjected to both the constraints.
- blkio.throttle.io_serviced
- Number of IOs (bio) issued to the disk by the group. These
are further divided by the type of operation - read or write, sync
or async. First two fields specify the major and minor number of the
device, third field specifies the operation type and the fourth field
specifies the number of IOs.
- blkio.throttle.io_service_bytes
- Number of bytes transferred to/from the disk by the group. These
are further divided by the type of operation - read or write, sync
or async. First two fields specify the major and minor number of the
device, third field specifies the operation type and the fourth field
specifies the number of bytes.
Common files among various policies
-----------------------------------
- blkio.reset_stats
- Writing an int to this file will result in resetting all the stats
for that cgroup.

View File

@@ -0,0 +1,695 @@
==============
Control Groups
==============
Written by Paul Menage <menage@google.com> based on
Documentation/admin-guide/cgroup-v1/cpusets.rst
Original copyright statements from cpusets.txt:
Portions Copyright (C) 2004 BULL SA.
Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
Modified by Paul Jackson <pj@sgi.com>
Modified by Christoph Lameter <cl@linux.com>
.. CONTENTS:
1. Control Groups
1.1 What are cgroups ?
1.2 Why are cgroups needed ?
1.3 How are cgroups implemented ?
1.4 What does notify_on_release do ?
1.5 What does clone_children do ?
1.6 How do I use cgroups ?
2. Usage Examples and Syntax
2.1 Basic Usage
2.2 Attaching processes
2.3 Mounting hierarchies by name
3. Kernel API
3.1 Overview
3.2 Synchronization
3.3 Subsystem API
4. Extended attributes usage
5. Questions
1. Control Groups
=================
1.1 What are cgroups ?
----------------------
Control Groups provide a mechanism for aggregating/partitioning sets of
tasks, and all their future children, into hierarchical groups with
specialized behaviour.
Definitions:
A *cgroup* associates a set of tasks with a set of parameters for one
or more subsystems.
A *subsystem* is a module that makes use of the task grouping
facilities provided by cgroups to treat groups of tasks in
particular ways. A subsystem is typically a "resource controller" that
schedules a resource or applies per-cgroup limits, but it may be
anything that wants to act on a group of processes, e.g. a
virtualization subsystem.
A *hierarchy* is a set of cgroups arranged in a tree, such that
every task in the system is in exactly one of the cgroups in the
hierarchy, and a set of subsystems; each subsystem has system-specific
state attached to each cgroup in the hierarchy. Each hierarchy has
an instance of the cgroup virtual filesystem associated with it.
At any one time there may be multiple active hierarchies of task
cgroups. Each hierarchy is a partition of all tasks in the system.
User-level code may create and destroy cgroups by name in an
instance of the cgroup virtual file system, specify and query to
which cgroup a task is assigned, and list the task PIDs assigned to
a cgroup. Those creations and assignments only affect the hierarchy
associated with that instance of the cgroup file system.
On their own, the only use for cgroups is for simple job
tracking. The intention is that other subsystems hook into the generic
cgroup support to provide new attributes for cgroups, such as
accounting/limiting the resources which processes in a cgroup can
access. For example, cpusets (see Documentation/admin-guide/cgroup-v1/cpusets.rst) allow
you to associate a set of CPUs and a set of memory nodes with the
tasks in each cgroup.
1.2 Why are cgroups needed ?
----------------------------
There are multiple efforts to provide process aggregations in the
Linux kernel, mainly for resource-tracking purposes. Such efforts
include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server
namespaces. These all require the basic notion of a
grouping/partitioning of processes, with newly forked processes ending
up in the same group (cgroup) as their parent process.
The kernel cgroup patch provides the minimum essential kernel
mechanisms required to efficiently implement such groups. It has
minimal impact on the system fast paths, and provides hooks for
specific subsystems such as cpusets to provide additional behaviour as
desired.
Multiple hierarchy support is provided to allow for situations where
the division of tasks into cgroups is distinctly different for
different subsystems - having parallel hierarchies allows each
hierarchy to be a natural division of tasks, without having to handle
complex combinations of tasks that would be present if several
unrelated subsystems needed to be forced into the same tree of
cgroups.
At one extreme, each resource controller or subsystem could be in a
separate hierarchy; at the other extreme, all subsystems
would be attached to the same hierarchy.
As an example of a scenario (originally proposed by vatsa@in.ibm.com)
that can benefit from multiple hierarchies, consider a large
university server with various users - students, professors, system
tasks etc. The resource planning for this server could be along the
following lines::
CPU : "Top cpuset"
/ \
CPUSet1 CPUSet2
| |
(Professors) (Students)
In addition (system tasks) are attached to topcpuset (so
that they can run anywhere) with a limit of 20%
Memory : Professors (50%), Students (30%), system (20%)
Disk : Professors (50%), Students (30%), system (20%)
Network : WWW browsing (20%), Network File System (60%), others (20%)
/ \
Professors (15%) students (5%)
Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd goes
into the NFS network class.
At the same time Firefox/Lynx will share an appropriate CPU/Memory class
depending on who launched it (prof/student).
With the ability to classify tasks differently for different resources
(by putting those resource subsystems in different hierarchies),
the admin can easily set up a script which receives exec notifications
and depending on who is launching the browser he can::
# echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks
With only a single hierarchy, he now would potentially have to create
a separate cgroup for every browser launched and associate it with
appropriate network and other resource class. This may lead to
proliferation of such cgroups.
Also let's say that the administrator would like to give enhanced network
access temporarily to a student's browser (since it is night and the user
wants to do online gaming :)) OR give one of the student's simulation
apps enhanced CPU power.
With ability to write PIDs directly to resource classes, it's just a
matter of::
# echo pid > /sys/fs/cgroup/network/<new_class>/tasks
(after some time)
# echo pid > /sys/fs/cgroup/network/<orig_class>/tasks
Without this ability, the administrator would have to split the cgroup into
multiple separate ones and then associate the new cgroups with the
new resource classes.
1.3 How are cgroups implemented ?
---------------------------------
Control Groups extends the kernel as follows:
- Each task in the system has a reference-counted pointer to a
css_set.
- A css_set contains a set of reference-counted pointers to
cgroup_subsys_state objects, one for each cgroup subsystem
registered in the system. There is no direct link from a task to
the cgroup of which it's a member in each hierarchy, but this
can be determined by following pointers through the
cgroup_subsys_state objects. This is because accessing the
subsystem state is something that's expected to happen frequently
and in performance-critical code, whereas operations that require a
task's actual cgroup assignments (in particular, moving between
cgroups) are less common. A linked list runs through the cg_list
field of each task_struct using the css_set, anchored at
css_set->tasks.
- A cgroup hierarchy filesystem can be mounted for browsing and
manipulation from user space.
- You can list all the tasks (by PID) attached to any cgroup.
The implementation of cgroups requires a few, simple hooks
into the rest of the kernel, none in performance-critical paths:
- in init/main.c, to initialize the root cgroups and initial
css_set at system boot.
- in fork and exit, to attach and detach a task from its css_set.
In addition, a new file system of type "cgroup" may be mounted, to
enable browsing and modifying the cgroups presently known to the
kernel. When mounting a cgroup hierarchy, you may specify a
comma-separated list of subsystems to mount as the filesystem mount
options. By default, mounting the cgroup filesystem attempts to
mount a hierarchy containing all registered subsystems.
If an active hierarchy with exactly the same set of subsystems already
exists, it will be reused for the new mount. If no existing hierarchy
matches, and any of the requested subsystems are in use in an existing
hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
is activated, associated with the requested subsystems.
It's not currently possible to bind a new subsystem to an active
cgroup hierarchy, or to unbind a subsystem from an active cgroup
hierarchy. This may be possible in future, but is fraught with nasty
error-recovery issues.
When a cgroup filesystem is unmounted, if there are any
child cgroups created below the top-level cgroup, that hierarchy
will remain active even though unmounted; if there are no
child cgroups then the hierarchy will be deactivated.
No new system calls are added for cgroups - all support for
querying and modifying cgroups is via this cgroup file system.
Each task under /proc has an added file named 'cgroup' displaying,
for each active hierarchy, the subsystem names and the cgroup name
as the path relative to the root of the cgroup file system.
Each cgroup is represented by a directory in the cgroup file system
containing the following files describing that cgroup:
- tasks: list of tasks (by PID) attached to that cgroup. This list
is not guaranteed to be sorted. Writing a thread ID into this file
moves the thread into this cgroup.
- cgroup.procs: list of thread group IDs in the cgroup. This list is
not guaranteed to be sorted or free of duplicate TGIDs, and userspace
should sort/uniquify the list if this property is required.
Writing a thread group ID into this file moves all threads in that
group into this cgroup.
- notify_on_release flag: run the release agent on exit?
- release_agent: the path to use for release notifications (this file
exists in the top cgroup only)
Other subsystems such as cpusets may add additional files in each
cgroup dir.
New cgroups are created using the mkdir system call or shell
command. The properties of a cgroup, such as its flags, are
modified by writing to the appropriate file in that cgroups
directory, as listed above.
The named hierarchical structure of nested cgroups allows partitioning
a large system into nested, dynamically changeable, "soft-partitions".
The attachment of each task, automatically inherited at fork by any
children of that task, to a cgroup allows organizing the work load
on a system into related sets of tasks. A task may be re-attached to
any other cgroup, if allowed by the permissions on the necessary
cgroup file system directories.
When a task is moved from one cgroup to another, it gets a new
css_set pointer - if there's an already existing css_set with the
desired collection of cgroups then that group is reused, otherwise a new
css_set is allocated. The appropriate existing css_set is located by
looking into a hash table.
To allow access from a cgroup to the css_sets (and hence tasks)
that comprise it, a set of cg_cgroup_link objects form a lattice;
each cg_cgroup_link is linked into a list of cg_cgroup_links for
a single cgroup on its cgrp_link_list field, and a list of
cg_cgroup_links for a single css_set on its cg_link_list.
Thus the set of tasks in a cgroup can be listed by iterating over
each css_set that references the cgroup, and sub-iterating over
each css_set's task set.
The use of a Linux virtual file system (vfs) to represent the
cgroup hierarchy provides for a familiar permission and name space
for cgroups, with a minimum of additional kernel code.
1.4 What does notify_on_release do ?
------------------------------------
If the notify_on_release flag is enabled (1) in a cgroup, then
whenever the last task in the cgroup leaves (exits or attaches to
some other cgroup) and the last child cgroup of that cgroup
is removed, then the kernel runs the command specified by the contents
of the "release_agent" file in that hierarchy's root directory,
supplying the pathname (relative to the mount point of the cgroup
file system) of the abandoned cgroup. This enables automatic
removal of abandoned cgroups. The default value of
notify_on_release in the root cgroup at system boot is disabled
(0). The default value of other cgroups at creation is the current
value of their parents' notify_on_release settings. The default value of
a cgroup hierarchy's release_agent path is empty.
1.5 What does clone_children do ?
---------------------------------
This flag only affects the cpuset controller. If the clone_children
flag is enabled (1) in a cgroup, a new cpuset cgroup will copy its
configuration from the parent during initialization.
1.6 How do I use cgroups ?
--------------------------
To start a new job that is to be contained within a cgroup, using
the "cpuset" cgroup subsystem, the steps are something like::
1) mount -t tmpfs cgroup_root /sys/fs/cgroup
2) mkdir /sys/fs/cgroup/cpuset
3) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
4) Create the new cgroup by doing mkdir's and write's (or echo's) in
the /sys/fs/cgroup/cpuset virtual file system.
5) Start a task that will be the "founding father" of the new job.
6) Attach that task to the new cgroup by writing its PID to the
/sys/fs/cgroup/cpuset tasks file for that cgroup.
7) fork, exec or clone the job tasks from this founding father task.
For example, the following sequence of commands will setup a cgroup
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
and then start a subshell 'sh' in that cgroup::
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/cpuset
mount -t cgroup cpuset -ocpuset /sys/fs/cgroup/cpuset
cd /sys/fs/cgroup/cpuset
mkdir Charlie
cd Charlie
/bin/echo 2-3 > cpuset.cpus
/bin/echo 1 > cpuset.mems
/bin/echo $$ > tasks
sh
# The subshell 'sh' is now running in cgroup Charlie
# The next line should display '/Charlie'
cat /proc/self/cgroup
2. Usage Examples and Syntax
============================
2.1 Basic Usage
---------------
Creating, modifying, using cgroups can be done through the cgroup
virtual filesystem.
To mount a cgroup hierarchy with all available subsystems, type::
# mount -t cgroup xxx /sys/fs/cgroup
The "xxx" is not interpreted by the cgroup code, but will appear in
/proc/mounts so may be any useful identifying string that you like.
Note: Some subsystems do not work without some user input first. For instance,
if cpusets are enabled the user will have to populate the cpus and mems files
for each new cgroup created before that group can be used.
As explained in section `1.2 Why are cgroups needed?` you should create
different hierarchies of cgroups for each single resource or group of
resources you want to control. Therefore, you should mount a tmpfs on
/sys/fs/cgroup and create directories for each cgroup resource or resource
group::
# mount -t tmpfs cgroup_root /sys/fs/cgroup
# mkdir /sys/fs/cgroup/rg1
To mount a cgroup hierarchy with just the cpuset and memory
subsystems, type::
# mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1
While remounting cgroups is currently supported, it is not recommend
to use it. Remounting allows changing bound subsystems and
release_agent. Rebinding is hardly useful as it only works when the
hierarchy is empty and release_agent itself should be replaced with
conventional fsnotify. The support for remounting will be removed in
the future.
To Specify a hierarchy's release_agent::
# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
xxx /sys/fs/cgroup/rg1
Note that specifying 'release_agent' more than once will return failure.
Note that changing the set of subsystems is currently only supported
when the hierarchy consists of a single (root) cgroup. Supporting
the ability to arbitrarily bind/unbind subsystems from an existing
cgroup hierarchy is intended to be implemented in the future.
Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the
tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1
is the cgroup that holds the whole system.
If you want to change the value of release_agent::
# echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent
It can also be changed via remount.
If you want to create a new cgroup under /sys/fs/cgroup/rg1::
# cd /sys/fs/cgroup/rg1
# mkdir my_cgroup
Now you want to do something with this cgroup:
# cd my_cgroup
In this directory you can find several files::
# ls
cgroup.procs notify_on_release tasks
(plus whatever files added by the attached subsystems)
Now attach your shell to this cgroup::
# /bin/echo $$ > tasks
You can also create cgroups inside your cgroup by using mkdir in this
directory::
# mkdir my_sub_cs
To remove a cgroup, just use rmdir::
# rmdir my_sub_cs
This will fail if the cgroup is in use (has cgroups inside, or
has processes attached, or is held alive by other subsystem-specific
reference).
2.2 Attaching processes
-----------------------
::
# /bin/echo PID > tasks
Note that it is PID, not PIDs. You can only attach ONE task at a time.
If you have several tasks to attach, you have to do it one after another::
# /bin/echo PID1 > tasks
# /bin/echo PID2 > tasks
...
# /bin/echo PIDn > tasks
You can attach the current shell task by echoing 0::
# echo 0 > tasks
You can use the cgroup.procs file instead of the tasks file to move all
threads in a threadgroup at once. Echoing the PID of any task in a
threadgroup to cgroup.procs causes all tasks in that threadgroup to be
attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
in the writing task's threadgroup.
Note: Since every task is always a member of exactly one cgroup in each
mounted hierarchy, to remove a task from its current cgroup you must
move it into a new cgroup (possibly the root cgroup) by writing to the
new cgroup's tasks file.
Note: Due to some restrictions enforced by some cgroup subsystems, moving
a process to another cgroup can fail.
2.3 Mounting hierarchies by name
--------------------------------
Passing the name=<x> option when mounting a cgroups hierarchy
associates the given name with the hierarchy. This can be used when
mounting a pre-existing hierarchy, in order to refer to it by name
rather than by its set of active subsystems. Each hierarchy is either
nameless, or has a unique name.
The name should match [\w.-]+
When passing a name=<x> option for a new hierarchy, you need to
specify subsystems manually; the legacy behaviour of mounting all
subsystems when none are explicitly specified is not supported when
you give a subsystem a name.
The name of the subsystem appears as part of the hierarchy description
in /proc/mounts and /proc/<pid>/cgroups.
3. Kernel API
=============
3.1 Overview
------------
Each kernel subsystem that wants to hook into the generic cgroup
system needs to create a cgroup_subsys object. This contains
various methods, which are callbacks from the cgroup system, along
with a subsystem ID which will be assigned by the cgroup system.
Other fields in the cgroup_subsys object include:
- subsys_id: a unique array index for the subsystem, indicating which
entry in cgroup->subsys[] this subsystem should be managing.
- name: should be initialized to a unique subsystem name. Should be
no longer than MAX_CGROUP_TYPE_NAMELEN.
- early_init: indicate if the subsystem needs early initialization
at system boot.
Each cgroup object created by the system has an array of pointers,
indexed by subsystem ID; this pointer is entirely managed by the
subsystem; the generic cgroup code will never touch this pointer.
3.2 Synchronization
-------------------
There is a global mutex, cgroup_mutex, used by the cgroup
system. This should be taken by anything that wants to modify a
cgroup. It may also be taken to prevent cgroups from being
modified, but more specific locks may be more appropriate in that
situation.
See kernel/cgroup.c for more details.
Subsystems can take/release the cgroup_mutex via the functions
cgroup_lock()/cgroup_unlock().
Accessing a task's cgroup pointer may be done in the following ways:
- while holding cgroup_mutex
- while holding the task's alloc_lock (via task_lock())
- inside an rcu_read_lock() section via rcu_dereference()
3.3 Subsystem API
-----------------
Each subsystem should:
- add an entry in linux/cgroup_subsys.h
- define a cgroup_subsys object called <name>_cgrp_subsys
Each subsystem may export the following methods. The only mandatory
methods are css_alloc/free. Any others that are null are presumed to
be successful no-ops.
``struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)``
(cgroup_mutex held by caller)
Called to allocate a subsystem state object for a cgroup. The
subsystem should allocate its subsystem state object for the passed
cgroup, returning a pointer to the new object on success or a
ERR_PTR() value. On success, the subsystem pointer should point to
a structure of type cgroup_subsys_state (typically embedded in a
larger subsystem-specific object), which will be initialized by the
cgroup system. Note that this will be called at initialization to
create the root subsystem state for this subsystem; this case can be
identified by the passed cgroup object having a NULL parent (since
it's the root of the hierarchy) and may be an appropriate place for
initialization code.
``int css_online(struct cgroup *cgrp)``
(cgroup_mutex held by caller)
Called after @cgrp successfully completed all allocations and made
visible to cgroup_for_each_child/descendant_*() iterators. The
subsystem may choose to fail creation by returning -errno. This
callback can be used to implement reliable state sharing and
propagation along the hierarchy. See the comment on
cgroup_for_each_descendant_pre() for details.
``void css_offline(struct cgroup *cgrp);``
(cgroup_mutex held by caller)
This is the counterpart of css_online() and called iff css_online()
has succeeded on @cgrp. This signifies the beginning of the end of
@cgrp. @cgrp is being removed and the subsystem should start dropping
all references it's holding on @cgrp. When all references are dropped,
cgroup removal will proceed to the next step - css_free(). After this
callback, @cgrp should be considered dead to the subsystem.
``void css_free(struct cgroup *cgrp)``
(cgroup_mutex held by caller)
The cgroup system is about to free @cgrp; the subsystem should free
its subsystem state object. By the time this method is called, @cgrp
is completely unused; @cgrp->parent is still valid. (Note - can also
be called for a newly-created cgroup if an error occurs after this
subsystem's create() method has been called for the new cgroup).
``int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
(cgroup_mutex held by caller)
Called prior to moving one or more tasks into a cgroup; if the
subsystem returns an error, this will abort the attach operation.
@tset contains the tasks to be attached and is guaranteed to have at
least one task in it.
If there are multiple tasks in the taskset, then:
- it's guaranteed that all are from the same thread group
- @tset contains all tasks from the thread group whether or not
they're switching cgroups
- the first task is the leader
Each @tset entry also contains the task's old cgroup and tasks which
aren't switching cgroup can be skipped easily using the
cgroup_taskset_for_each() iterator. Note that this isn't called on a
fork. If this method returns 0 (success) then this should remain valid
while the caller holds cgroup_mutex and it is ensured that either
attach() or cancel_attach() will be called in future.
``void css_reset(struct cgroup_subsys_state *css)``
(cgroup_mutex held by caller)
An optional operation which should restore @css's configuration to the
initial state. This is currently only used on the unified hierarchy
when a subsystem is disabled on a cgroup through
"cgroup.subtree_control" but should remain enabled because other
subsystems depend on it. cgroup core makes such a css invisible by
removing the associated interface files and invokes this callback so
that the hidden subsystem can return to the initial neutral state.
This prevents unexpected resource control from a hidden css and
ensures that the configuration is in the initial state when it is made
visible again later.
``void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
(cgroup_mutex held by caller)
Called when a task attach operation has failed after can_attach() has succeeded.
A subsystem whose can_attach() has some side-effects should provide this
function, so that the subsystem can implement a rollback. If not, not necessary.
This will be called only about subsystems whose can_attach() operation have
succeeded. The parameters are identical to can_attach().
``void attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
(cgroup_mutex held by caller)
Called after the task has been attached to the cgroup, to allow any
post-attachment activity that requires memory allocations or blocking.
The parameters are identical to can_attach().
``void fork(struct task_struct *task)``
Called when a task is forked into a cgroup.
``void exit(struct task_struct *task)``
Called during task exit.
``void free(struct task_struct *task)``
Called when the task_struct is freed.
``void bind(struct cgroup *root)``
(cgroup_mutex held by caller)
Called when a cgroup subsystem is rebound to a different hierarchy
and root cgroup. Currently this will only involve movement between
the default hierarchy (which never has sub-cgroups) and a hierarchy
that is being created/destroyed (and hence has no sub-cgroups).
4. Extended attribute usage
===========================
cgroup filesystem supports certain types of extended attributes in its
directories and files. The current supported types are:
- Trusted (XATTR_TRUSTED)
- Security (XATTR_SECURITY)
Both require CAP_SYS_ADMIN capability to set.
Like in tmpfs, the extended attributes in cgroup filesystem are stored
using kernel memory and it's advised to keep the usage at minimum. This
is the reason why user defined extended attributes are not supported, since
any user can do it and there's no limit in the value size.
The current known users for this feature are SELinux to limit cgroup usage
in containers and systemd for assorted meta data like main PID in a cgroup
(systemd creates a cgroup per service).
5. Questions
============
::
Q: what's up with this '/bin/echo' ?
A: bash's builtin 'echo' command does not check calls to write() against
errors. If you use it in the cgroup file system, you won't be
able to tell whether a command succeeded or failed.
Q: When I attach processes, only the first of the line gets really attached !
A: We can only return one error code per call to write(). So you should also
put only ONE PID.

View File

@@ -0,0 +1,50 @@
=========================
CPU Accounting Controller
=========================
The CPU accounting controller is used to group tasks using cgroups and
account the CPU usage of these groups of tasks.
The CPU accounting controller supports multi-hierarchy groups. An accounting
group accumulates the CPU usage of all of its child groups and the tasks
directly present in its group.
Accounting groups can be created by first mounting the cgroup filesystem::
# mount -t cgroup -ocpuacct none /sys/fs/cgroup
With the above step, the initial or the parent accounting group becomes
visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
/sys/fs/cgroup/cpuacct.usage gives the CPU time (in nanoseconds) obtained
by this group which is essentially the CPU time obtained by all the tasks
in the system.
New accounting groups can be created under the parent group /sys/fs/cgroup::
# cd /sys/fs/cgroup
# mkdir g1
# echo $$ > g1/tasks
The above steps create a new group g1 and move the current shell
process (bash) into it. CPU time consumed by this bash and its children
can be obtained from g1/cpuacct.usage and the same is accumulated in
/sys/fs/cgroup/cpuacct.usage also.
cpuacct.stat file lists a few statistics which further divide the
CPU time obtained by the cgroup into user and system times. Currently
the following statistics are supported:
user: Time spent by tasks of the cgroup in user mode.
system: Time spent by tasks of the cgroup in kernel mode.
user and system are in USER_HZ unit.
cpuacct controller uses percpu_counter interface to collect user and
system times. This has two side effects:
- It is theoretically possible to see wrong values for user and system times.
This is because percpu_counter_read() on 32bit systems isn't safe
against concurrent writes.
- It is possible to see slightly outdated values for user and system times
due to the batch processing nature of percpu_counter.

View File

@@ -0,0 +1,866 @@
=======
CPUSETS
=======
Copyright (C) 2004 BULL SA.
Written by Simon.Derr@bull.net
- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
- Modified by Paul Jackson <pj@sgi.com>
- Modified by Christoph Lameter <cl@linux.com>
- Modified by Paul Menage <menage@google.com>
- Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
.. CONTENTS:
1. Cpusets
1.1 What are cpusets ?
1.2 Why are cpusets needed ?
1.3 How are cpusets implemented ?
1.4 What are exclusive cpusets ?
1.5 What is memory_pressure ?
1.6 What is memory spread ?
1.7 What is sched_load_balance ?
1.8 What is sched_relax_domain_level ?
1.9 How do I use cpusets ?
2. Usage Examples and Syntax
2.1 Basic Usage
2.2 Adding/removing cpus
2.3 Setting flags
2.4 Attaching processes
3. Questions
4. Contact
1. Cpusets
==========
1.1 What are cpusets ?
----------------------
Cpusets provide a mechanism for assigning a set of CPUs and Memory
Nodes to a set of tasks. In this document "Memory Node" refers to
an on-line node that contains memory.
Cpusets constrain the CPU and Memory placement of tasks to only
the resources within a task's current cpuset. They form a nested
hierarchy visible in a virtual file system. These are the essential
hooks, beyond what is already present, required to manage dynamic
job placement on large systems.
Cpusets use the generic cgroup subsystem described in
Documentation/admin-guide/cgroup-v1/cgroups.rst.
Requests by a task, using the sched_setaffinity(2) system call to
include CPUs in its CPU affinity mask, and using the mbind(2) and
set_mempolicy(2) system calls to include Memory Nodes in its memory
policy, are both filtered through that task's cpuset, filtering out any
CPUs or Memory Nodes not in that cpuset. The scheduler will not
schedule a task on a CPU that is not allowed in its cpus_allowed
vector, and the kernel page allocator will not allocate a page on a
node that is not allowed in the requesting task's mems_allowed vector.
User level code may create and destroy cpusets by name in the cgroup
virtual file system, manage the attributes and permissions of these
cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
specify and query to which cpuset a task is assigned, and list the
task pids assigned to a cpuset.
1.2 Why are cpusets needed ?
----------------------------
The management of large computer systems, with many processors (CPUs),
complex memory cache hierarchies and multiple Memory Nodes having
non-uniform access times (NUMA) presents additional challenges for
the efficient scheduling and memory placement of processes.
Frequently more modest sized systems can be operated with adequate
efficiency just by letting the operating system automatically share
the available CPU and Memory resources amongst the requesting tasks.
But larger systems, which benefit more from careful processor and
memory placement to reduce memory access times and contention,
and which typically represent a larger investment for the customer,
can benefit from explicitly placing jobs on properly sized subsets of
the system.
This can be especially valuable on:
* Web Servers running multiple instances of the same web application,
* Servers running different applications (for instance, a web server
and a database), or
* NUMA systems running large HPC applications with demanding
performance characteristics.
These subsets, or "soft partitions" must be able to be dynamically
adjusted, as the job mix changes, without impacting other concurrently
executing jobs. The location of the running jobs pages may also be moved
when the memory locations are changed.
The kernel cpuset patch provides the minimum essential kernel
mechanisms required to efficiently implement such subsets. It
leverages existing CPU and Memory Placement facilities in the Linux
kernel to avoid any additional impact on the critical scheduler or
memory allocator code.
1.3 How are cpusets implemented ?
---------------------------------
Cpusets provide a Linux kernel mechanism to constrain which CPUs and
Memory Nodes are used by a process or set of processes.
The Linux kernel already has a pair of mechanisms to specify on which
CPUs a task may be scheduled (sched_setaffinity) and on which Memory
Nodes it may obtain memory (mbind, set_mempolicy).
Cpusets extends these two mechanisms as follows:
- Cpusets are sets of allowed CPUs and Memory Nodes, known to the
kernel.
- Each task in the system is attached to a cpuset, via a pointer
in the task structure to a reference counted cgroup structure.
- Calls to sched_setaffinity are filtered to just those CPUs
allowed in that task's cpuset.
- Calls to mbind and set_mempolicy are filtered to just
those Memory Nodes allowed in that task's cpuset.
- The root cpuset contains all the systems CPUs and Memory
Nodes.
- For any cpuset, one can define child cpusets containing a subset
of the parents CPU and Memory Node resources.
- The hierarchy of cpusets can be mounted at /dev/cpuset, for
browsing and manipulation from user space.
- A cpuset may be marked exclusive, which ensures that no other
cpuset (except direct ancestors and descendants) may contain
any overlapping CPUs or Memory Nodes.
- You can list all the tasks (by pid) attached to any cpuset.
The implementation of cpusets requires a few, simple hooks
into the rest of the kernel, none in performance critical paths:
- in init/main.c, to initialize the root cpuset at system boot.
- in fork and exit, to attach and detach a task from its cpuset.
- in sched_setaffinity, to mask the requested CPUs by what's
allowed in that task's cpuset.
- in sched.c migrate_live_tasks(), to keep migrating tasks within
the CPUs allowed by their cpuset, if possible.
- in the mbind and set_mempolicy system calls, to mask the requested
Memory Nodes by what's allowed in that task's cpuset.
- in page_alloc.c, to restrict memory to allowed nodes.
- in vmscan.c, to restrict page recovery to the current cpuset.
You should mount the "cgroup" filesystem type in order to enable
browsing and modifying the cpusets presently known to the kernel. No
new system calls are added for cpusets - all support for querying and
modifying cpusets is via this cpuset file system.
The /proc/<pid>/status file for each task has four added lines,
displaying the task's cpus_allowed (on which CPUs it may be scheduled)
and mems_allowed (on which Memory Nodes it may obtain memory),
in the two formats seen in the following example::
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-127
Mems_allowed: ffffffff,ffffffff
Mems_allowed_list: 0-63
Each cpuset is represented by a directory in the cgroup file system
containing (on top of the standard cgroup files) the following
files describing that cpuset:
- cpuset.cpus: list of CPUs in that cpuset
- cpuset.mems: list of Memory Nodes in that cpuset
- cpuset.memory_migrate flag: if set, move pages to cpusets nodes
- cpuset.cpu_exclusive flag: is cpu placement exclusive?
- cpuset.mem_exclusive flag: is memory placement exclusive?
- cpuset.mem_hardwall flag: is memory allocation hardwalled
- cpuset.memory_pressure: measure of how much paging pressure in cpuset
- cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
- cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
- cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
- cpuset.sched_relax_domain_level: the searching range when migrating tasks
In addition, only the root cpuset has the following file:
- cpuset.memory_pressure_enabled flag: compute memory_pressure?
New cpusets are created using the mkdir system call or shell
command. The properties of a cpuset, such as its flags, allowed
CPUs and Memory Nodes, and attached tasks, are modified by writing
to the appropriate file in that cpusets directory, as listed above.
The named hierarchical structure of nested cpusets allows partitioning
a large system into nested, dynamically changeable, "soft-partitions".
The attachment of each task, automatically inherited at fork by any
children of that task, to a cpuset allows organizing the work load
on a system into related sets of tasks such that each set is constrained
to using the CPUs and Memory Nodes of a particular cpuset. A task
may be re-attached to any other cpuset, if allowed by the permissions
on the necessary cpuset file system directories.
Such management of a system "in the large" integrates smoothly with
the detailed placement done on individual tasks and memory regions
using the sched_setaffinity, mbind and set_mempolicy system calls.
The following rules apply to each cpuset:
- Its CPUs and Memory Nodes must be a subset of its parents.
- It can't be marked exclusive unless its parent is.
- If its cpu or memory is exclusive, they may not overlap any sibling.
These rules, and the natural hierarchy of cpusets, enable efficient
enforcement of the exclusive guarantee, without having to scan all
cpusets every time any of them change to ensure nothing overlaps a
exclusive cpuset. Also, the use of a Linux virtual file system (vfs)
to represent the cpuset hierarchy provides for a familiar permission
and name space for cpusets, with a minimum of additional kernel code.
The cpus and mems files in the root (top_cpuset) cpuset are
read-only. The cpus file automatically tracks the value of
cpu_online_mask using a CPU hotplug notifier, and the mems file
automatically tracks the value of node_states[N_MEMORY]--i.e.,
nodes with memory--using the cpuset_track_online_nodes() hook.
1.4 What are exclusive cpusets ?
--------------------------------
If a cpuset is cpu or mem exclusive, no other cpuset, other than
a direct ancestor or descendant, may share any of the same CPUs or
Memory Nodes.
A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled",
i.e. it restricts kernel allocations for page, buffer and other data
commonly shared by the kernel across multiple users. All cpusets,
whether hardwalled or not, restrict allocations of memory for user
space. This enables configuring a system so that several independent
jobs can share common kernel data, such as file system pages, while
isolating each job's user allocation in its own cpuset. To do this,
construct a large mem_exclusive cpuset to hold all the jobs, and
construct child, non-mem_exclusive cpusets for each individual job.
Only a small amount of typical kernel memory, such as requests from
interrupt handlers, is allowed to be taken outside even a
mem_exclusive cpuset.
1.5 What is memory_pressure ?
-----------------------------
The memory_pressure of a cpuset provides a simple per-cpuset metric
of the rate that the tasks in a cpuset are attempting to free up in
use memory on the nodes of the cpuset to satisfy additional memory
requests.
This enables batch managers monitoring jobs running in dedicated
cpusets to efficiently detect what level of memory pressure that job
is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or re-prioritize jobs that
are trying to use more memory than allowed on the nodes assigned to them,
and with tightly coupled, long running, massively parallel scientific
computing jobs that will dramatically fail to meet required performance
goals if they start to use more memory than allowed to them.
This mechanism provides a very economical way for the batch manager
to monitor a cpuset for signs of memory pressure. It's up to the
batch manager or other user code to decide what to do about it and
take action.
==>
Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm,
the system load imposed by a batch scheduler monitoring this
metric is sharply reduced on large systems, because a scan of
the tasklist can be avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a
single read, instead of having to read and accumulate results
for a period of time.
Because this meter is per-cpuset rather than per-task or mm,
the batch scheduler can obtain the key information, memory
pressure in a cpuset, with a single read, rather than having to
query and accumulate results over all the (dynamically changing)
set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words
of data per-cpuset) is kept, and updated by any task attached to that
cpuset, if it enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by
the tasks in the cpuset, in units of reclaims attempted per second,
times 1000.
1.6 What is memory spread ?
---------------------------
There are two boolean flag files per cpuset that control where the
kernel allocates pages for the file system buffers and related in
kernel data structures. They are called 'cpuset.memory_spread_page' and
'cpuset.memory_spread_slab'.
If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then
the kernel will spread the file system buffers (page cache) evenly
over all the nodes that the faulting task is allowed to use, instead
of preferring to put those pages on the node where the task is running.
If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set,
then the kernel will spread some file system related slab caches,
such as for inodes and dentries evenly over all the nodes that the
faulting task is allowed to use, instead of preferring to put those
pages on the node where the task is running.
The setting of these flags does not affect anonymous data segment or
stack segment pages of a task.
By default, both kinds of memory spreading are off, and memory
pages are allocated on the node local to where the task is running,
except perhaps as modified by the task's NUMA mempolicy or cpuset
configuration, so long as sufficient free memory pages are available.
When new cpusets are created, they inherit the memory spread settings
of their parent.
Setting memory spreading causes allocations for the affected page
or slab caches to ignore the task's NUMA mempolicy and be spread
instead. Tasks using mbind() or set_mempolicy() calls to set NUMA
mempolicies will not notice any change in these calls as a result of
their containing task's memory spread settings. If memory spreading
is turned off, then the currently specified NUMA mempolicy once again
applies to memory page allocations.
Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag
files. By default they contain "0", meaning that the feature is off
for that cpuset. If a "1" is written to that file, then that turns
the named feature on.
The implementation is simple.
Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag
PFA_SPREAD_PAGE for each task that is in that cpuset or subsequently
joins that cpuset. The page allocation calls for the page cache
is modified to perform an inline check for this PFA_SPREAD_PAGE task
flag, and if set, a call to a new routine cpuset_mem_spread_node()
returns the node to prefer for the allocation.
Similarly, setting 'cpuset.memory_spread_slab' turns on the flag
PFA_SPREAD_SLAB, and appropriately marked slab caches will allocate
pages from the node returned by cpuset_mem_spread_node().
The cpuset_mem_spread_node() routine is also simple. It uses the
value of a per-task rotor cpuset_mem_spread_rotor to select the next
node in the current task's mems_allowed to prefer for the allocation.
This memory placement policy is also known (in other contexts) as
round-robin or interleave.
This policy can provide substantial improvements for jobs that need
to place thread local data on the corresponding node, but that need
to access large file system data sets that need to be spread across
the several nodes in the jobs cpuset in order to fit. Without this
policy, especially for jobs that might have one thread reading in the
data set, the memory allocation across the nodes in the jobs cpuset
can become very uneven.
1.7 What is sched_load_balance ?
--------------------------------
The kernel scheduler (kernel/sched/core.c) automatically load balances
tasks. If one CPU is underutilized, kernel code running on that
CPU will look for tasks on other more overloaded CPUs and move those
tasks to itself, within the constraints of such placement mechanisms
as cpusets and sched_setaffinity.
The algorithmic cost of load balancing and its impact on key shared
kernel data structures such as the task list increases more than
linearly with the number of CPUs being balanced. So the scheduler
has support to partition the systems CPUs into a number of sched
domains such that it only load balances within each sched domain.
Each sched domain covers some subset of the CPUs in the system;
no two sched domains overlap; some CPUs might not be in any sched
domain and hence won't be load balanced.
Put simply, it costs less to balance between two smaller sched domains
than one big one, but doing so means that overloads in one of the
two domains won't be load balanced to the other one.
By default, there is one sched domain covering all CPUs, including those
marked isolated using the kernel boot time "isolcpus=" argument. However,
the isolated CPUs will not participate in load balancing, and will not
have tasks running on them unless explicitly assigned.
This default load balancing across all CPUs is not well suited for
the following two situations:
1) On large systems, load balancing across many CPUs is expensive.
If the system is managed using cpusets to place independent jobs
on separate sets of CPUs, full load balancing is unnecessary.
2) Systems supporting realtime on some CPUs need to minimize
system overhead on those CPUs, including avoiding task load
balancing if that is not needed.
When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
be contained in a single sched domain, ensuring that load balancing
can move a task (not otherwised pinned, as by sched_setaffinity)
from any CPU in that cpuset to any other.
When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the
scheduler will avoid load balancing across the CPUs in that cpuset,
--except-- in so far as is necessary because some overlapping cpuset
has "sched_load_balance" enabled.
So, for example, if the top cpuset has the flag "cpuset.sched_load_balance"
enabled, then the scheduler will have one sched domain covering all
CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other
cpusets won't matter, as we're already fully load balancing.
Therefore in the above two situations, the top cpuset flag
"cpuset.sched_load_balance" should be disabled, and only some of the smaller,
child cpusets have this flag enabled.
When doing this, you don't usually want to leave any unpinned tasks in
the top cpuset that might use non-trivial amounts of CPU, as such tasks
may be artificially constrained to some subset of CPUs, depending on
the particulars of this flag setting in descendant cpusets. Even if
such a task could use spare CPU cycles in some other CPUs, the kernel
scheduler might not consider the possibility of load balancing that
task to that underused CPU.
Of course, tasks pinned to a particular CPU can be left in a cpuset
that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere
else anyway.
There is an impedance mismatch here, between cpusets and sched domains.
Cpusets are hierarchical and nest. Sched domains are flat; they don't
overlap and each CPU is in at most one sched domain.
It is necessary for sched domains to be flat because load balancing
across partially overlapping sets of CPUs would risk unstable dynamics
that would be beyond our understanding. So if each of two partially
overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we
form a single sched domain that is a superset of both. We won't move
a task to a CPU outside its cpuset, but the scheduler load balancing
code might waste some compute cycles considering that possibility.
This mismatch is why there is not a simple one-to-one relation
between which cpusets have the flag "cpuset.sched_load_balance" enabled,
and the sched domain configuration. If a cpuset enables the flag, it
will get balancing across all its CPUs, but if it disables the flag,
it will only be assured of no load balancing if no other overlapping
cpuset enables the flag.
If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only
one of them has this flag enabled, then the other may find its
tasks only partially load balanced, just on the overlapping CPUs.
This is just the general case of the top_cpuset example given a few
paragraphs above. In the general case, as in the top cpuset case,
don't leave tasks that might use non-trivial amounts of CPU in
such partially load balanced cpusets, as they may be artificially
constrained to some subset of the CPUs allowed to them, for lack of
load balancing to the other CPUs.
CPUs in "cpuset.isolcpus" were excluded from load balancing by the
isolcpus= kernel boot option, and will never be load balanced regardless
of the value of "cpuset.sched_load_balance" in any cpuset.
1.7.1 sched_load_balance implementation details.
------------------------------------------------
The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary
to most cpuset flags.) When enabled for a cpuset, the kernel will
ensure that it can load balance across all the CPUs in that cpuset
(makes sure that all the CPUs in the cpus_allowed of that cpuset are
in the same sched domain.)
If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled,
then they will be (must be) both in the same sched domain.
If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled,
then by the above that means there is a single sched domain covering
the whole system, regardless of any other cpuset settings.
The kernel commits to user space that it will avoid load balancing
where it can. It will pick as fine a granularity partition of sched
domains as it can while still providing load balancing for any set
of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled.
The internal kernel cpuset to scheduler interface passes from the
cpuset code to the scheduler code a partition of the load balanced
CPUs in the system. This partition is a set of subsets (represented
as an array of struct cpumask) of CPUs, pairwise disjoint, that cover
all the CPUs that must be load balanced.
The cpuset code builds a new such partition and passes it to the
scheduler sched domain setup code, to have the sched domains rebuilt
as necessary, whenever:
- the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
- or CPUs come or go from a cpuset with this flag enabled,
- or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
and with this flag enabled changes,
- or a cpuset with non-empty CPUs and with this flag enabled is removed,
- or a cpu is offlined/onlined.
This partition exactly defines what sched domains the scheduler should
setup - one sched domain for each element (struct cpumask) in the
partition.
The scheduler remembers the currently active sched domain partitions.
When the scheduler routine partition_sched_domains() is invoked from
the cpuset code to update these sched domains, it compares the new
partition requested with the current, and updates its sched domains,
removing the old and adding the new, for each change.
1.8 What is sched_relax_domain_level ?
--------------------------------------
In sched domain, the scheduler migrates tasks in 2 ways; periodic load
balance on tick, and at time of some schedule events.
When a task is woken up, scheduler try to move the task on idle CPU.
For example, if a task A running on CPU X activates another task B
on the same CPU X, and if CPU Y is X's sibling and performing idle,
then scheduler migrate task B to CPU Y so that task B can start on
CPU Y without waiting task A on CPU X.
And if a CPU run out of tasks in its runqueue, the CPU try to pull
extra tasks from other busy CPUs to help them before it is going to
be idle.
Of course it takes some searching cost to find movable tasks and/or
idle CPUs, the scheduler might not search all CPUs in the domain
every time. In fact, in some architectures, the searching ranges on
events are limited in the same socket or node where the CPU locates,
while the load balance on tick searches all.
For example, assume CPU Z is relatively far from CPU X. Even if CPU Z
is idle while CPU X and the siblings are busy, scheduler can't migrate
woken task B from X to Z since it is out of its searching range.
As the result, task B on CPU X need to wait task A or wait load balance
on the next tick. For some applications in special situation, waiting
1 tick may be too long.
The 'cpuset.sched_relax_domain_level' file allows you to request changing
this searching range as you like. This file takes int value which
indicates size of searching range in levels ideally as follows,
otherwise initial value -1 that indicates the cpuset has no request.
====== ===========================================================
-1 no request. use system default or follow request of others.
0 no search.
1 search siblings (hyperthreads in a core).
2 search cores in a package.
3 search cpus in a node [= system wide on non-NUMA system]
4 search nodes in a chunk of node [on NUMA system]
5 search system wide [on NUMA system]
====== ===========================================================
The system default is architecture dependent. The system default
can be changed using the relax_domain_level= boot parameter.
This file is per-cpuset and affect the sched domain where the cpuset
belongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset
is disabled, then 'cpuset.sched_relax_domain_level' have no effect since
there is no sched domain belonging the cpuset.
If multiple cpusets are overlapping and hence they form a single sched
domain, the largest value among those is used. Be careful, if one
requests 0 and others are -1 then 0 is used.
Note that modifying this file will have both good and bad effects,
and whether it is acceptable or not depends on your situation.
Don't modify this file if you are not sure.
If your situation is:
- The migration costs between each cpu can be assumed considerably
small(for you) due to your special application's behavior or
special hardware support for CPU cache etc.
- The searching cost doesn't have impact(for you) or you can make
the searching cost enough small by managing cpuset to compact etc.
- The latency is required even it sacrifices cache hit rate etc.
then increasing 'sched_relax_domain_level' would benefit you.
1.9 How do I use cpusets ?
--------------------------
In order to minimize the impact of cpusets on critical kernel
code, such as the scheduler, and due to the fact that the kernel
does not support one task updating the memory placement of another
task directly, the impact on a task of changing its cpuset CPU
or Memory Node placement, or of changing to which cpuset a task
is attached, is subtle.
If a cpuset has its Memory Nodes modified, then for each task attached
to that cpuset, the next time that the kernel attempts to allocate
a page of memory for that task, the kernel will notice the change
in the task's cpuset, and update its per-task memory placement to
remain within the new cpusets memory placement. If the task was using
mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
its new cpuset, then the task will continue to use whatever subset
of MPOL_BIND nodes are still allowed in the new cpuset. If the task
was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
in the new cpuset, then the task will be essentially treated as if it
was MPOL_BIND bound to the new cpuset (even though its NUMA placement,
as queried by get_mempolicy(), doesn't change). If a task is moved
from one cpuset to another, then the kernel will adjust the task's
memory placement, as above, the next time that the kernel attempts
to allocate a page of memory for that task.
If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
will have its allowed CPU placement changed immediately. Similarly,
if a task's pid is written to another cpuset's 'tasks' file, then its
allowed CPU placement is changed immediately. If such a task had been
bound to some subset of its cpuset using the sched_setaffinity() call,
the task will be allowed to run on any CPU allowed in its new cpuset,
negating the effect of the prior sched_setaffinity() call.
In summary, the memory placement of a task whose cpuset is changed is
updated by the kernel, on the next allocation of a page for that task,
and the processor placement is updated immediately.
Normally, once a page is allocated (given a physical page
of main memory) then that page stays on whatever node it
was allocated, so long as it remains allocated, even if the
cpusets memory placement policy 'cpuset.mems' subsequently changes.
If the cpuset flag file 'cpuset.memory_migrate' is set true, then when
tasks are attached to that cpuset, any pages that task had
allocated to it on nodes in its previous cpuset are migrated
to the task's new cpuset. The relative placement of the page within
the cpuset is preserved during these migration operations if possible.
For example if the page was on the second valid node of the prior cpuset
then the page will be placed on the second valid node of the new cpuset.
Also if 'cpuset.memory_migrate' is set true, then if that cpuset's
'cpuset.mems' file is modified, pages allocated to tasks in that
cpuset, that were on nodes in the previous setting of 'cpuset.mems',
will be moved to nodes in the new setting of 'mems.'
Pages that were not in the task's prior cpuset, or in the cpuset's
prior 'cpuset.mems' setting, will not be moved.
There is an exception to the above. If hotplug functionality is used
to remove all the CPUs that are currently assigned to a cpuset,
then all the tasks in that cpuset will be moved to the nearest ancestor
with non-empty cpus. But the moving of some (or all) tasks might fail if
cpuset is bound with another cgroup subsystem which has some restrictions
on task attaching. In this failing case, those tasks will stay
in the original cpuset, and the kernel will automatically update
their cpus_allowed to allow all online CPUs. When memory hotplug
functionality for removing Memory Nodes is available, a similar exception
is expected to apply there as well. In general, the kernel prefers to
violate cpuset placement, over starving a task that has had all
its allowed CPUs or Memory Nodes taken offline.
There is a second exception to the above. GFP_ATOMIC requests are
kernel internal allocations that must be satisfied, immediately.
The kernel may drop some request, in rare cases even panic, if a
GFP_ATOMIC alloc fails. If the request cannot be satisfied within
the current task's cpuset, then we relax the cpuset, and look for
memory anywhere we can find it. It's better to violate the cpuset
than stress the kernel.
To start a new job that is to be contained within a cpuset, the steps are:
1) mkdir /sys/fs/cgroup/cpuset
2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
3) Create the new cpuset by doing mkdir's and write's (or echo's) in
the /sys/fs/cgroup/cpuset virtual file system.
4) Start a task that will be the "founding father" of the new job.
5) Attach that task to the new cpuset by writing its pid to the
/sys/fs/cgroup/cpuset tasks file for that cpuset.
6) fork, exec or clone the job tasks from this founding father task.
For example, the following sequence of commands will setup a cpuset
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
and then start a subshell 'sh' in that cpuset::
mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
cd /sys/fs/cgroup/cpuset
mkdir Charlie
cd Charlie
/bin/echo 2-3 > cpuset.cpus
/bin/echo 1 > cpuset.mems
/bin/echo $$ > tasks
sh
# The subshell 'sh' is now running in cpuset Charlie
# The next line should display '/Charlie'
cat /proc/self/cpuset
There are ways to query or modify cpusets:
- via the cpuset file system directly, using the various cd, mkdir, echo,
cat, rmdir commands from the shell, or their equivalent from C.
- via the C library libcpuset.
- via the C library libcgroup.
(http://sourceforge.net/projects/libcg/)
- via the python application cset.
(http://code.google.com/p/cpuset/)
The sched_setaffinity calls can also be done at the shell prompt using
SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
calls can be done at the shell prompt using the numactl command
(part of Andi Kleen's numa package).
2. Usage Examples and Syntax
============================
2.1 Basic Usage
---------------
Creating, modifying, using the cpusets can be done through the cpuset
virtual filesystem.
To mount it, type:
# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset
Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
is the cpuset that holds the whole system.
If you want to create a new cpuset under /sys/fs/cgroup/cpuset::
# cd /sys/fs/cgroup/cpuset
# mkdir my_cpuset
Now you want to do something with this cpuset::
# cd my_cpuset
In this directory you can find several files::
# ls
cgroup.clone_children cpuset.memory_pressure
cgroup.event_control cpuset.memory_spread_page
cgroup.procs cpuset.memory_spread_slab
cpuset.cpu_exclusive cpuset.mems
cpuset.cpus cpuset.sched_load_balance
cpuset.mem_exclusive cpuset.sched_relax_domain_level
cpuset.mem_hardwall notify_on_release
cpuset.memory_migrate tasks
Reading them will give you information about the state of this cpuset:
the CPUs and Memory Nodes it can use, the processes that are using
it, its properties. By writing to these files you can manipulate
the cpuset.
Set some flags::
# /bin/echo 1 > cpuset.cpu_exclusive
Add some cpus::
# /bin/echo 0-7 > cpuset.cpus
Add some mems::
# /bin/echo 0-7 > cpuset.mems
Now attach your shell to this cpuset::
# /bin/echo $$ > tasks
You can also create cpusets inside your cpuset by using mkdir in this
directory::
# mkdir my_sub_cs
To remove a cpuset, just use rmdir::
# rmdir my_sub_cs
This will fail if the cpuset is in use (has cpusets inside, or has
processes attached).
Note that for legacy reasons, the "cpuset" filesystem exists as a
wrapper around the cgroup filesystem.
The command::
mount -t cpuset X /sys/fs/cgroup/cpuset
is equivalent to::
mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
2.2 Adding/removing cpus
------------------------
This is the syntax to use when writing in the cpus or mems files
in cpuset directories::
# /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
# /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
To add a CPU to a cpuset, write the new list of CPUs including the
CPU to be added. To add 6 to the above cpuset::
# /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6
Similarly to remove a CPU from a cpuset, write the new list of CPUs
without the CPU to be removed.
To remove all the CPUs::
# /bin/echo "" > cpuset.cpus -> clear cpus list
2.3 Setting flags
-----------------
The syntax is very simple::
# /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive'
# /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive'
2.4 Attaching processes
-----------------------
::
# /bin/echo PID > tasks
Note that it is PID, not PIDs. You can only attach ONE task at a time.
If you have several tasks to attach, you have to do it one after another::
# /bin/echo PID1 > tasks
# /bin/echo PID2 > tasks
...
# /bin/echo PIDn > tasks
3. Questions
============
Q:
what's up with this '/bin/echo' ?
A:
bash's builtin 'echo' command does not check calls to write() against
errors. If you use it in the cpuset file system, you won't be
able to tell whether a command succeeded or failed.
Q:
When I attach processes, only the first of the line gets really attached !
A:
We can only return one error code per call to write(). So you should also
put only ONE pid.
4. Contact
==========
Web: http://www.bullopensource.org/cpuset

View File

@@ -0,0 +1,132 @@
===========================
Device Whitelist Controller
===========================
1. Description
==============
Implement a cgroup to track and enforce open and mknod restrictions
on device files. A device cgroup associates a device access
whitelist with each cgroup. A whitelist entry has 4 fields.
'type' is a (all), c (char), or b (block). 'all' means it applies
to all types and all major and minor numbers. Major and minor are
either an integer or * for all. Access is a composition of r
(read), w (write), and m (mknod).
The root device cgroup starts with rwm to 'all'. A child device
cgroup gets a copy of the parent. Administrators can then remove
devices from the whitelist or add new entries. A child cgroup can
never receive a device access which is denied by its parent.
2. User Interface
=================
An entry is added using devices.allow, and removed using
devices.deny. For instance::
echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow
allows cgroup 1 to read and mknod the device usually known as
/dev/null. Doing::
echo a > /sys/fs/cgroup/1/devices.deny
will remove the default 'a *:* rwm' entry. Doing::
echo a > /sys/fs/cgroup/1/devices.allow
will add the 'a *:* rwm' entry to the whitelist.
3. Security
===========
Any task can move itself between cgroups. This clearly won't
suffice, but we can decide the best way to adequately restrict
movement as people get some experience with this. We may just want
to require CAP_SYS_ADMIN, which at least is a separate bit from
CAP_MKNOD. We may want to just refuse moving to a cgroup which
isn't a descendant of the current one. Or we may want to use
CAP_MAC_ADMIN, since we really are trying to lock down root.
CAP_SYS_ADMIN is needed to modify the whitelist or move another
task to a new cgroup. (Again we'll probably want to change that).
A cgroup may not be granted more permissions than the cgroup's
parent has.
4. Hierarchy
============
device cgroups maintain hierarchy by making sure a cgroup never has more
access permissions than its parent. Every time an entry is written to
a cgroup's devices.deny file, all its children will have that entry removed
from their whitelist and all the locally set whitelist entries will be
re-evaluated. In case one of the locally set whitelist entries would provide
more access than the cgroup's parent, it'll be removed from the whitelist.
Example::
A
/ \
B
group behavior exceptions
A allow "b 8:* rwm", "c 116:1 rw"
B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm"
If a device is denied in group A::
# echo "c 116:* r" > A/devices.deny
it'll propagate down and after revalidating B's entries, the whitelist entry
"c 116:2 rwm" will be removed::
group whitelist entries denied devices
A all "b 8:* rwm", "c 116:* rw"
B "c 1:3 rwm", "b 3:* rwm" all the rest
In case parent's exceptions change and local exceptions are not allowed
anymore, they'll be deleted.
Notice that new whitelist entries will not be propagated::
A
/ \
B
group whitelist entries denied devices
A "c 1:3 rwm", "c 1:5 r" all the rest
B "c 1:3 rwm", "c 1:5 r" all the rest
when adding ``c *:3 rwm``::
# echo "c *:3 rwm" >A/devices.allow
the result::
group whitelist entries denied devices
A "c *:3 rwm", "c 1:5 r" all the rest
B "c 1:3 rwm", "c 1:5 r" all the rest
but now it'll be possible to add new entries to B::
# echo "c 2:3 rwm" >B/devices.allow
# echo "c 50:3 r" >B/devices.allow
or even::
# echo "c *:3 rwm" >B/devices.allow
Allowing or denying all by writing 'a' to devices.allow or devices.deny will
not be possible once the device cgroups has children.
4.1 Hierarchy (internal implementation)
---------------------------------------
device cgroups is implemented internally using a behavior (ALLOW, DENY) and a
list of exceptions. The internal state is controlled using the same user
interface to preserve compatibility with the previous whitelist-only
implementation. Removal or addition of exceptions that will reduce the access
to devices will be propagated down the hierarchy.
For every propagated exception, the effective rules will be re-evaluated based
on current parent's access rules.

View File

@@ -0,0 +1,127 @@
==============
Cgroup Freezer
==============
The cgroup freezer is useful to batch job management system which start
and stop sets of tasks in order to schedule the resources of a machine
according to the desires of a system administrator. This sort of program
is often used on HPC clusters to schedule access to the cluster as a
whole. The cgroup freezer uses cgroups to describe the set of tasks to
be started/stopped by the batch job management system. It also provides
a means to start and stop the tasks composing the job.
The cgroup freezer will also be useful for checkpointing running groups
of tasks. The freezer allows the checkpoint code to obtain a consistent
image of the tasks by attempting to force the tasks in a cgroup into a
quiescent state. Once the tasks are quiescent another task can
walk /proc or invoke a kernel interface to gather information about the
quiesced tasks. Checkpointed tasks can be restarted later should a
recoverable error occur. This also allows the checkpointed tasks to be
migrated between nodes in a cluster by copying the gathered information
to another node and restarting the tasks there.
Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping
and resuming tasks in userspace. Both of these signals are observable
from within the tasks we wish to freeze. While SIGSTOP cannot be caught,
blocked, or ignored it can be seen by waiting or ptracing parent tasks.
SIGCONT is especially unsuitable since it can be caught by the task. Any
programs designed to watch for SIGSTOP and SIGCONT could be broken by
attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can
demonstrate this problem using nested bash shells::
$ echo $$
16644
$ bash
$ echo $$
16690
From a second, unrelated bash shell:
$ kill -SIGSTOP 16690
$ kill -SIGCONT 16690
<at this point 16690 exits and causes 16644 to exit too>
This happens because bash can observe both signals and choose how it
responds to them.
Another example of a program which catches and responds to these
signals is gdb. In fact any program designed to use ptrace is likely to
have a problem with this method of stopping and resuming tasks.
In contrast, the cgroup freezer uses the kernel freezer code to
prevent the freeze/unfreeze cycle from becoming visible to the tasks
being frozen. This allows the bash example above and gdb to run as
expected.
The cgroup freezer is hierarchical. Freezing a cgroup freezes all
tasks belonging to the cgroup and all its descendant cgroups. Each
cgroup has its own state (self-state) and the state inherited from the
parent (parent-state). Iff both states are THAWED, the cgroup is
THAWED.
The following cgroupfs files are created by cgroup freezer.
* freezer.state: Read-write.
When read, returns the effective state of the cgroup - "THAWED",
"FREEZING" or "FROZEN". This is the combined self and parent-states.
If any is freezing, the cgroup is freezing (FREEZING or FROZEN).
FREEZING cgroup transitions into FROZEN state when all tasks
belonging to the cgroup and its descendants become frozen. Note that
a cgroup reverts to FREEZING from FROZEN after a new task is added
to the cgroup or one of its descendant cgroups until the new task is
frozen.
When written, sets the self-state of the cgroup. Two values are
allowed - "FROZEN" and "THAWED". If FROZEN is written, the cgroup,
if not already freezing, enters FREEZING state along with all its
descendant cgroups.
If THAWED is written, the self-state of the cgroup is changed to
THAWED. Note that the effective state may not change to THAWED if
the parent-state is still freezing. If a cgroup's effective state
becomes THAWED, all its descendants which are freezing because of
the cgroup also leave the freezing state.
* freezer.self_freezing: Read only.
Shows the self-state. 0 if the self-state is THAWED; otherwise, 1.
This value is 1 iff the last write to freezer.state was "FROZEN".
* freezer.parent_freezing: Read only.
Shows the parent-state. 0 if none of the cgroup's ancestors is
frozen; otherwise, 1.
The root cgroup is non-freezable and the above interface files don't
exist.
* Examples of usage::
# mkdir /sys/fs/cgroup/freezer
# mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer
# mkdir /sys/fs/cgroup/freezer/0
# echo $some_pid > /sys/fs/cgroup/freezer/0/tasks
to get status of the freezer subsystem::
# cat /sys/fs/cgroup/freezer/0/freezer.state
THAWED
to freeze all tasks in the container::
# echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state
# cat /sys/fs/cgroup/freezer/0/freezer.state
FREEZING
# cat /sys/fs/cgroup/freezer/0/freezer.state
FROZEN
to unfreeze all tasks in the container::
# echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state
# cat /sys/fs/cgroup/freezer/0/freezer.state
THAWED
This is the basic mechanism which should do the right thing for user space task
in a simple scenario.

View File

@@ -0,0 +1,50 @@
==================
HugeTLB Controller
==================
The HugeTLB controller allows to limit the HugeTLB usage per control group and
enforces the controller limit during page fault. Since HugeTLB doesn't
support page reclaim, enforcing the limit at page fault time implies that,
the application will get SIGBUS signal if it tries to access HugeTLB pages
beyond its limit. This requires the application to know beforehand how much
HugeTLB pages it would require for its use.
HugeTLB controller can be created by first mounting the cgroup filesystem.
# mount -t cgroup -o hugetlb none /sys/fs/cgroup
With the above step, the initial or the parent HugeTLB group becomes
visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
New groups can be created under the parent group /sys/fs/cgroup::
# cd /sys/fs/cgroup
# mkdir g1
# echo $$ > g1/tasks
The above steps create a new group g1 and move the current shell
process (bash) into it.
Brief summary of control files::
hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit
For a system supporting three hugepage sizes (64k, 32M and 1G), the control
files include::
hugetlb.1GB.limit_in_bytes
hugetlb.1GB.max_usage_in_bytes
hugetlb.1GB.usage_in_bytes
hugetlb.1GB.failcnt
hugetlb.64KB.limit_in_bytes
hugetlb.64KB.max_usage_in_bytes
hugetlb.64KB.usage_in_bytes
hugetlb.64KB.failcnt
hugetlb.32MB.limit_in_bytes
hugetlb.32MB.max_usage_in_bytes
hugetlb.32MB.usage_in_bytes
hugetlb.32MB.failcnt

View File

@@ -0,0 +1,28 @@
========================
Control Groups version 1
========================
.. toctree::
:maxdepth: 1
cgroups
blkio-controller
cpuacct
cpusets
devices
freezer-subsystem
hugetlb
memcg_test
memory
net_cls
net_prio
pids
rdma
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@@ -0,0 +1,355 @@
=====================================================
Memory Resource Controller(Memcg) Implementation Memo
=====================================================
Last Updated: 2010/2
Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
Because VM is getting complex (one of reasons is memcg...), memcg's behavior
is complex. This is a document for memcg's internal behavior.
Please note that implementation details can be changed.
(*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst)
0. How to record usage ?
========================
2 objects are used.
page_cgroup ....an object per page.
Allocated at boot or memory hotplug. Freed at memory hot removal.
swap_cgroup ... an entry per swp_entry.
Allocated at swapon(). Freed at swapoff().
The page_cgroup has USED bit and double count against a page_cgroup never
occurs. swap_cgroup is used only when a charged page is swapped-out.
1. Charge
=========
a page/swp_entry may be charged (usage += PAGE_SIZE) at
mem_cgroup_try_charge()
2. Uncharge
===========
a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
mem_cgroup_uncharge()
Called when a page's refcount goes down to 0.
mem_cgroup_uncharge_swap()
Called when swp_entry's refcnt goes down to 0. A charge against swap
disappears.
3. charge-commit-cancel
=======================
Memcg pages are charged in two steps:
- mem_cgroup_try_charge()
- mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
At try_charge(), there are no flags to say "this page is charged".
at this point, usage += PAGE_SIZE.
At commit(), the page is associated with the memcg.
At cancel(), simply usage -= PAGE_SIZE.
Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
4. Anonymous
============
Anonymous page is newly allocated at
- page fault into MAP_ANONYMOUS mapping.
- Copy-On-Write.
4.1 Swap-in.
At swap-in, the page is taken from swap-cache. There are 2 cases.
(a) If the SwapCache is newly allocated and read, it has no charges.
(b) If the SwapCache has been mapped by processes, it has been
charged already.
4.2 Swap-out.
At swap-out, typical state transition is below.
(a) add to swap cache. (marked as SwapCache)
swp_entry's refcnt += 1.
(b) fully unmapped.
swp_entry's refcnt += # of ptes.
(c) write back to swap.
(d) delete from swap cache. (remove from SwapCache)
swp_entry's refcnt -= 1.
Finally, at task exit,
(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
5. Page Cache
=============
Page Cache is charged at
- add_to_page_cache_locked().
The logic is very clear. (About migration, see below)
Note:
__remove_from_page_cache() is called by remove_from_page_cache()
and __remove_mapping().
6. Shmem(tmpfs) Page Cache
===========================
The best way to understand shmem's page state transition is to read
mm/shmem.c.
But brief explanation of the behavior of memcg around shmem will be
helpful to understand the logic.
Shmem's page (just leaf page, not direct/indirect block) can be on
- radix-tree of shmem's inode.
- SwapCache.
- Both on radix-tree and SwapCache. This happens at swap-in
and swap-out,
It's charged when...
- A new page is added to shmem's radix-tree.
- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
7. Page Migration
=================
mem_cgroup_migrate()
8. LRU
======
Each memcg has its own private LRU. Now, its handling is under global
VM's control (means that it's handled under global pgdat->lru_lock).
Almost all routines around memcg's LRU is called by global LRU's
list management functions under pgdat->lru_lock.
A special function is mem_cgroup_isolate_pages(). This scans
memcg's private LRU and call __isolate_lru_page() to extract a page
from LRU.
(By __isolate_lru_page(), the page is removed from both of global and
private LRU.)
9. Typical Tests.
=================
Tests for racy cases.
9.1 Small limit to memcg.
-------------------------
When you do test to do racy case, it's good test to set memcg's limit
to be very small rather than GB. Many races found in the test under
xKB or xxMB limits.
(Memory behavior under GB and Memory behavior under MB shows very
different situation.)
9.2 Shmem
---------
Historically, memcg's shmem handling was poor and we saw some amount
of troubles here. This is because shmem is page-cache but can be
SwapCache. Test with shmem/tmpfs is always good test.
9.3 Migration
-------------
For NUMA, migration is an another special case. To do easy test, cpuset
is useful. Following is a sample script to do migration::
mount -t cgroup -o cpuset none /opt/cpuset
mkdir /opt/cpuset/01
echo 1 > /opt/cpuset/01/cpuset.cpus
echo 0 > /opt/cpuset/01/cpuset.mems
echo 1 > /opt/cpuset/01/cpuset.memory_migrate
mkdir /opt/cpuset/02
echo 1 > /opt/cpuset/02/cpuset.cpus
echo 1 > /opt/cpuset/02/cpuset.mems
echo 1 > /opt/cpuset/02/cpuset.memory_migrate
In above set, when you moves a task from 01 to 02, page migration to
node 0 to node 1 will occur. Following is a script to migrate all
under cpuset.::
--
move_task()
{
for pid in $1
do
/bin/echo $pid >$2/tasks 2>/dev/null
echo -n $pid
echo -n " "
done
echo END
}
G1_TASK=`cat ${G1}/tasks`
G2_TASK=`cat ${G2}/tasks`
move_task "${G1_TASK}" ${G2} &
--
9.4 Memory hotplug
------------------
memory hotplug test is one of good test.
to offline memory, do following::
# echo offline > /sys/devices/system/memory/memoryXXX/state
(XXX is the place of memory)
This is an easy way to test page migration, too.
9.5 mkdir/rmdir
---------------
When using hierarchy, mkdir/rmdir test should be done.
Use tests like the following::
echo 1 >/opt/cgroup/01/memory/use_hierarchy
mkdir /opt/cgroup/01/child_a
mkdir /opt/cgroup/01/child_b
set limit to 01.
add limit to 01/child_b
run jobs under child_a and child_b
create/delete following groups at random while jobs are running::
/opt/cgroup/01/child_a/child_aa
/opt/cgroup/01/child_b/child_bb
/opt/cgroup/01/child_c
running new jobs in new group is also good.
9.6 Mount with other subsystems
-------------------------------
Mounting with other subsystems is a good test because there is a
race and lock dependency with other cgroup subsystems.
example::
# mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
and do task move, mkdir, rmdir etc...under this.
9.7 swapoff
-----------
Besides management of swap is one of complicated parts of memcg,
call path of swap-in at swapoff is not same as usual swap-in path..
It's worth to be tested explicitly.
For example, test like following is good:
(Shell-A)::
# mount -t cgroup none /cgroup -o memory
# mkdir /cgroup/test
# echo 40M > /cgroup/test/memory.limit_in_bytes
# echo 0 > /cgroup/test/tasks
Run malloc(100M) program under this. You'll see 60M of swaps.
(Shell-B)::
# move all tasks in /cgroup/test to /cgroup
# /sbin/swapoff -a
# rmdir /cgroup/test
# kill malloc task.
Of course, tmpfs v.s. swapoff test should be tested, too.
9.8 OOM-Killer
--------------
Out-of-memory caused by memcg's limit will kill tasks under
the memcg. When hierarchy is used, a task under hierarchy
will be killed by the kernel.
In this case, panic_on_oom shouldn't be invoked and tasks
in other groups shouldn't be killed.
It's not difficult to cause OOM under memcg as following.
Case A) when you can swapoff::
#swapoff -a
#echo 50M > /memory.limit_in_bytes
run 51M of malloc
Case B) when you use mem+swap limitation::
#echo 50M > memory.limit_in_bytes
#echo 50M > memory.memsw.limit_in_bytes
run 51M of malloc
9.9 Move charges at task migration
----------------------------------
Charges associated with a task can be moved along with task migration.
(Shell-A)::
#mkdir /cgroup/A
#echo $$ >/cgroup/A/tasks
run some programs which uses some amount of memory in /cgroup/A.
(Shell-B)::
#mkdir /cgroup/B
#echo 1 >/cgroup/B/memory.move_charge_at_immigrate
#echo "pid of the program running in group A" >/cgroup/B/tasks
You can see charges have been moved by reading ``*.usage_in_bytes`` or
memory.stat of both A and B.
See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should
be written to move_charge_at_immigrate.
9.10 Memory thresholds
----------------------
Memory controller implements memory thresholds using cgroups notification
API. You can use tools/cgroup/cgroup_event_listener.c to test it.
(Shell-A) Create cgroup and run event listener::
# mkdir /cgroup/A
# ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
(Shell-B) Add task to cgroup and try to allocate and free memory::
# echo $$ >/cgroup/A/tasks
# a="$(dd if=/dev/zero bs=1M count=10)"
# a=
You will see message from cgroup_event_listener every time you cross
the thresholds.
Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
It's good idea to test root cgroup as well.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,44 @@
=========================
Network classifier cgroup
=========================
The Network classifier cgroup provides an interface to
tag network packets with a class identifier (classid).
The Traffic Controller (tc) can be used to assign
different priorities to packets from different cgroups.
Also, Netfilter (iptables) can use this tag to perform
actions on such packets.
Creating a net_cls cgroups instance creates a net_cls.classid file.
This net_cls.classid value is initialized to 0.
You can write hexadecimal values to net_cls.classid; the format for these
values is 0xAAAABBBB; AAAA is the major handle number and BBBB
is the minor handle number.
Reading net_cls.classid yields a decimal result.
Example::
mkdir /sys/fs/cgroup/net_cls
mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls
mkdir /sys/fs/cgroup/net_cls/0
echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid
- setting a 10:1 handle::
cat /sys/fs/cgroup/net_cls/0/net_cls.classid
1048577
- configuring tc::
tc qdisc add dev eth0 root handle 10: htb
tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit
- creating traffic class 10:1::
tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup
configuring iptables, basic example::
iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP

View File

@@ -0,0 +1,57 @@
=======================
Network priority cgroup
=======================
The Network priority cgroup provides an interface to allow an administrator to
dynamically set the priority of network traffic generated by various
applications
Nominally, an application would set the priority of its traffic via the
SO_PRIORITY socket option. This however, is not always possible because:
1) The application may not have been coded to set this value
2) The priority of application traffic is often a site-specific administrative
decision rather than an application defined one.
This cgroup allows an administrator to assign a process to a group which defines
the priority of egress traffic on a given interface. Network priority groups can
be created by first mounting the cgroup filesystem::
# mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio
With the above step, the initial group acting as the parent accounting group
becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in
the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup.
Each net_prio cgroup contains two files that are subsystem specific
net_prio.prioidx
This file is read-only, and is simply informative. It contains a unique
integer value that the kernel uses as an internal representation of this
cgroup.
net_prio.ifpriomap
This file contains a map of the priorities assigned to traffic originating
from processes in this group and egressing the system on various interfaces.
It contains a list of tuples in the form <ifname priority>. Contents of this
file can be modified by echoing a string into the file using the same tuple
format. For example::
echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap
This command would force any traffic originating from processes belonging to the
iscsi net_prio cgroup and egressing on interface eth0 to have the priority of
said traffic set to the value 5. The parent accounting group also has a
writeable 'net_prio.ifpriomap' file that can be used to set a system default
priority.
Priorities are set immediately prior to queueing a frame to the device
queueing discipline (qdisc) so priorities will be assigned prior to the hardware
queue selection being made.
One usage for the net_prio cgroup is with mqprio qdisc allowing application
traffic to be steered to hardware/driver based traffic classes. These mappings
can then be managed by administrators or other networking protocols such as
DCBX.
A new net_prio cgroup inherits the parent's configuration.

View File

@@ -0,0 +1,92 @@
=========================
Process Number Controller
=========================
Abstract
--------
The process number controller is used to allow a cgroup hierarchy to stop any
new tasks from being fork()'d or clone()'d after a certain limit is reached.
Since it is trivial to hit the task limit without hitting any kmemcg limits in
place, PIDs are a fundamental resource. As such, PID exhaustion must be
preventable in the scope of a cgroup hierarchy by allowing resource limiting of
the number of tasks in a cgroup.
Usage
-----
In order to use the `pids` controller, set the maximum number of tasks in
pids.max (this is not available in the root cgroup for obvious reasons). The
number of processes currently in the cgroup is given by pids.current.
Organisational operations are not blocked by cgroup policies, so it is possible
to have pids.current > pids.max. This can be done by either setting the limit to
be smaller than pids.current, or attaching enough processes to the cgroup such
that pids.current > pids.max. However, it is not possible to violate a cgroup
policy through fork() or clone(). fork() and clone() will return -EAGAIN if the
creation of a new process would cause a cgroup policy to be violated.
To set a cgroup to have no limit, set pids.max to "max". This is the default for
all new cgroups (N.B. that PID limits are hierarchical, so the most stringent
limit in the hierarchy is followed).
pids.current tracks all child cgroup hierarchies, so parent/pids.current is a
superset of parent/child/pids.current.
The pids.events file contains event counters:
- max: Number of times fork failed because limit was hit.
Example
-------
First, we mount the pids controller::
# mkdir -p /sys/fs/cgroup/pids
# mount -t cgroup -o pids none /sys/fs/cgroup/pids
Then we create a hierarchy, set limits and attach processes to it::
# mkdir -p /sys/fs/cgroup/pids/parent/child
# echo 2 > /sys/fs/cgroup/pids/parent/pids.max
# echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs
# cat /sys/fs/cgroup/pids/parent/pids.current
2
#
It should be noted that attempts to overcome the set limit (2 in this case) will
fail::
# cat /sys/fs/cgroup/pids/parent/pids.current
2
# ( /bin/echo "Here's some processes for you." | cat )
sh: fork: Resource temporary unavailable
#
Even if we migrate to a child cgroup (which doesn't have a set limit), we will
not be able to overcome the most stringent limit in the hierarchy (in this case,
parent's)::
# echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs
# cat /sys/fs/cgroup/pids/parent/pids.current
2
# cat /sys/fs/cgroup/pids/parent/child/pids.current
2
# cat /sys/fs/cgroup/pids/parent/child/pids.max
max
# ( /bin/echo "Here's some processes for you." | cat )
sh: fork: Resource temporary unavailable
#
We can set a limit that is smaller than pids.current, which will stop any new
processes from being forked at all (note that the shell itself counts towards
pids.current)::
# echo 1 > /sys/fs/cgroup/pids/parent/pids.max
# /bin/echo "We can't even spawn a single process now."
sh: fork: Resource temporary unavailable
# echo 0 > /sys/fs/cgroup/pids/parent/pids.max
# /bin/echo "We can't even spawn a single process now."
sh: fork: Resource temporary unavailable
#

View File

@@ -0,0 +1,117 @@
===============
RDMA Controller
===============
.. Contents
1. Overview
1-1. What is RDMA controller?
1-2. Why RDMA controller needed?
1-3. How is RDMA controller implemented?
2. Usage Examples
1. Overview
===========
1-1. What is RDMA controller?
-----------------------------
RDMA controller allows user to limit RDMA/IB specific resources that a given
set of processes can use. These processes are grouped using RDMA controller.
RDMA controller defines two resources which can be limited for processes of a
cgroup.
1-2. Why RDMA controller needed?
--------------------------------
Currently user space applications can easily take away all the rdma verb
specific resources such as AH, CQ, QP, MR etc. Due to which other applications
in other cgroup or kernel space ULPs may not even get chance to allocate any
rdma resources. This can lead to service unavailability.
Therefore RDMA controller is needed through which resource consumption
of processes can be limited. Through this controller different rdma
resources can be accounted.
1-3. How is RDMA controller implemented?
----------------------------------------
RDMA cgroup allows limit configuration of resources. Rdma cgroup maintains
resource accounting per cgroup, per device using resource pool structure.
Each such resource pool is limited up to 64 resources in given resource pool
by rdma cgroup, which can be extended later if required.
This resource pool object is linked to the cgroup css. Typically there
are 0 to 4 resource pool instances per cgroup, per device in most use cases.
But nothing limits to have it more. At present hundreds of RDMA devices per
single cgroup may not be handled optimally, however there is no
known use case or requirement for such configuration either.
Since RDMA resources can be allocated from any process and can be freed by any
of the child processes which shares the address space, rdma resources are
always owned by the creator cgroup css. This allows process migration from one
to other cgroup without major complexity of transferring resource ownership;
because such ownership is not really present due to shared nature of
rdma resources. Linking resources around css also ensures that cgroups can be
deleted after processes migrated. This allow progress migration as well with
active resources, even though that is not a primary use case.
Whenever RDMA resource charging occurs, owner rdma cgroup is returned to
the caller. Same rdma cgroup should be passed while uncharging the resource.
This also allows process migrated with active RDMA resource to charge
to new owner cgroup for new resource. It also allows to uncharge resource of
a process from previously charged cgroup which is migrated to new cgroup,
even though that is not a primary use case.
Resource pool object is created in following situations.
(a) User sets the limit and no previous resource pool exist for the device
of interest for the cgroup.
(b) No resource limits were configured, but IB/RDMA stack tries to
charge the resource. So that it correctly uncharge them when applications are
running without limits and later on when limits are enforced during uncharging,
otherwise usage count will drop to negative.
Resource pool is destroyed if all the resource limits are set to max and
it is the last resource getting deallocated.
User should set all the limit to max value if it intents to remove/unconfigure
the resource pool for a particular device.
IB stack honors limits enforced by the rdma controller. When application
query about maximum resource limits of IB device, it returns minimum of
what is configured by user for a given cgroup and what is supported by
IB device.
Following resources can be accounted by rdma controller.
========== =============================
hca_handle Maximum number of HCA Handles
hca_object Maximum number of HCA Objects
========== =============================
2. Usage Examples
=================
(a) Configure resource limit::
echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max
echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max
(b) Query resource limit::
cat /sys/fs/cgroup/rdma/2/rdma.max
#Output:
mlx4_0 hca_handle=2 hca_object=2000
ocrdma1 hca_handle=3 hca_object=max
(c) Query current usage::
cat /sys/fs/cgroup/rdma/2/rdma.current
#Output:
mlx4_0 hca_handle=1 hca_object=20
ocrdma1 hca_handle=1 hca_object=23
(d) Delete resource limit::
echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max

View File

@@ -9,7 +9,7 @@ This is the authoritative documentation on the design, interface and
conventions of cgroup v2. It describes all userland-visible aspects
of cgroup including core and specific controller behaviors. All
future changes must be reflected in this document. Documentation for
v1 is available under Documentation/cgroup-v1/.
v1 is available under Documentation/admin-guide/cgroup-v1/.
.. CONTENTS
@@ -705,6 +705,12 @@ Conventions
informational files on the root cgroup which end up showing global
information available elsewhere shouldn't exist.
- The default time unit is microseconds. If a different unit is ever
used, an explicit unit suffix must be present.
- A parts-per quantity should use a percentage decimal with at least
two digit fractional part - e.g. 13.40.
- If a controller implements weight based resource distribution, its
interface file should be named "weight" and have the range [1,
10000] with 100 as the default. The values are chosen to allow
@@ -1008,7 +1014,7 @@ All time durations are in microseconds.
A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for CPU. See
Documentation/accounting/psi.txt for details.
Documentation/accounting/psi.rst for details.
Memory
@@ -1140,6 +1146,11 @@ PAGE_SIZE multiple when read back.
otherwise, a value change in this file generates a file
modified event.
Note that all fields in this file are hierarchical and the
file modified event can be generated due to an event down the
hierarchy. For for the local events at the cgroup level see
memory.events.local.
low
The number of times the cgroup is reclaimed due to
high memory pressure even though its usage is under
@@ -1179,6 +1190,11 @@ PAGE_SIZE multiple when read back.
The number of processes belonging to this cgroup
killed by any kind of OOM killer.
memory.events.local
Similar to memory.events but the fields in the file are local
to the cgroup i.e. not hierarchical. The file modified event
generated on this file reflects only the local events.
memory.stat
A read-only flat-keyed file which exists on non-root cgroups.
@@ -1339,7 +1355,7 @@ PAGE_SIZE multiple when read back.
A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for memory. See
Documentation/accounting/psi.txt for details.
Documentation/accounting/psi.rst for details.
Usage Guidelines
@@ -1482,7 +1498,7 @@ IO Interface Files
A read-only nested-key file which exists on non-root cgroups.
Shows pressure stall information for IO. See
Documentation/accounting/psi.txt for details.
Documentation/accounting/psi.rst for details.
Writeback
@@ -2108,7 +2124,7 @@ following two functions.
a queue (device) has been associated with the bio and
before submission.
wbc_account_io(@wbc, @page, @bytes)
wbc_account_cgroup_owner(@wbc, @page, @bytes)
Should be called for each data segment being written out.
While this function doesn't care exactly when it's called
during the writeback session, it's the easiest and most

View File

@@ -0,0 +1,9 @@
Clearing WARN_ONCE
------------------
WARN_ONCE / WARN_ON_ONCE / printk_once only emit a message once.
echo 1 > /sys/kernel/debug/clear_warn_once
clears the state and allows the warnings to print once again.
This can be useful after test suite runs to reproduce problems.

View File

@@ -1,10 +0,0 @@
# -*- coding: utf-8; mode: python -*-
project = 'Linux Kernel User Documentation'
tags.add("subproject")
latex_documents = [
('index', 'linux-user.tex', 'Linux Kernel User Documentation',
'The kernel development community', 'manual'),
]

View File

@@ -0,0 +1,114 @@
========
CPU load
========
Linux exports various bits of information via ``/proc/stat`` and
``/proc/uptime`` that userland tools, such as top(1), use to calculate
the average time system spent in a particular state, for example::
$ iostat
Linux 2.6.18.3-exp (linmac) 02/20/2007
avg-cpu: %user %nice %system %iowait %steal %idle
10.01 0.00 2.92 5.44 0.00 81.63
...
Here the system thinks that over the default sampling period the
system spent 10.01% of the time doing work in user space, 2.92% in the
kernel, and was overall 81.63% of the time idle.
In most cases the ``/proc/stat`` information reflects the reality quite
closely, however due to the nature of how/when the kernel collects
this data sometimes it can not be trusted at all.
So how is this information collected? Whenever timer interrupt is
signalled the kernel looks what kind of task was running at this
moment and increments the counter that corresponds to this tasks
kind/state. The problem with this is that the system could have
switched between various states multiple times between two timer
interrupts yet the counter is incremented only for the last state.
Example
-------
If we imagine the system with one task that periodically burns cycles
in the following manner::
time line between two timer interrupts
|--------------------------------------|
^ ^
|_ something begins working |
|_ something goes to sleep
(only to be awaken quite soon)
In the above situation the system will be 0% loaded according to the
``/proc/stat`` (since the timer interrupt will always happen when the
system is executing the idle handler), but in reality the load is
closer to 99%.
One can imagine many more situations where this behavior of the kernel
will lead to quite erratic information inside ``/proc/stat``::
/* gcc -o hog smallhog.c */
#include <time.h>
#include <limits.h>
#include <signal.h>
#include <sys/time.h>
#define HIST 10
static volatile sig_atomic_t stop;
static void sighandler (int signr)
{
(void) signr;
stop = 1;
}
static unsigned long hog (unsigned long niters)
{
stop = 0;
while (!stop && --niters);
return niters;
}
int main (void)
{
int i;
struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 },
.it_value = { .tv_sec = 0, .tv_usec = 1 } };
sigset_t set;
unsigned long v[HIST];
double tmp = 0.0;
unsigned long n;
signal (SIGALRM, &sighandler);
setitimer (ITIMER_REAL, &it, NULL);
hog (ULONG_MAX);
for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX);
for (i = 0; i < HIST; ++i) tmp += v[i];
tmp /= HIST;
n = tmp - (tmp / 3.0);
sigemptyset (&set);
sigaddset (&set, SIGALRM);
for (;;) {
hog (n);
sigwait (&set, &i);
}
return 0;
}
References
----------
- http://lkml.org/lkml/2007/2/12/6
- Documentation/filesystems/proc.txt (1.8)
Thanks
------
Con Kolivas, Pavel Machek

View File

@@ -0,0 +1,177 @@
===========================================
How CPU topology info is exported via sysfs
===========================================
Export CPU topology info via sysfs. Items (attributes) are similar
to /proc/cpuinfo output of some architectures. They reside in
/sys/devices/system/cpu/cpuX/topology/:
physical_package_id:
physical package id of cpuX. Typically corresponds to a physical
socket number, but the actual value is architecture and platform
dependent.
die_id:
the CPU die ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
architecture and platform dependent.
core_id:
the CPU core ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
architecture and platform dependent.
book_id:
the book ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
architecture and platform dependent.
drawer_id:
the drawer ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
architecture and platform dependent.
core_cpus:
internal kernel map of CPUs within the same core.
(deprecated name: "thread_siblings")
core_cpus_list:
human-readable list of CPUs within the same core.
(deprecated name: "thread_siblings_list");
package_cpus:
internal kernel map of the CPUs sharing the same physical_package_id.
(deprecated name: "core_siblings")
package_cpus_list:
human-readable list of CPUs sharing the same physical_package_id.
(deprecated name: "core_siblings_list")
die_cpus:
internal kernel map of CPUs within the same die.
die_cpus_list:
human-readable list of CPUs within the same die.
book_siblings:
internal kernel map of cpuX's hardware threads within the same
book_id.
book_siblings_list:
human-readable list of cpuX's hardware threads within the same
book_id.
drawer_siblings:
internal kernel map of cpuX's hardware threads within the same
drawer_id.
drawer_siblings_list:
human-readable list of cpuX's hardware threads within the same
drawer_id.
Architecture-neutral, drivers/base/topology.c, exports these attributes.
However, the book and drawer related sysfs files will only be created if
CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are selected, respectively.
CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are currently only used on s390,
where they reflect the cpu and cache hierarchy.
For an architecture to support this feature, it must define some of
these macros in include/asm-XXX/topology.h::
#define topology_physical_package_id(cpu)
#define topology_die_id(cpu)
#define topology_core_id(cpu)
#define topology_book_id(cpu)
#define topology_drawer_id(cpu)
#define topology_sibling_cpumask(cpu)
#define topology_core_cpumask(cpu)
#define topology_die_cpumask(cpu)
#define topology_book_cpumask(cpu)
#define topology_drawer_cpumask(cpu)
The type of ``**_id macros`` is int.
The type of ``**_cpumask macros`` is ``(const) struct cpumask *``. The latter
correspond with appropriate ``**_siblings`` sysfs attributes (except for
topology_sibling_cpumask() which corresponds with thread_siblings).
To be consistent on all architectures, include/linux/topology.h
provides default definitions for any of the above macros that are
not defined by include/asm-XXX/topology.h:
1) topology_physical_package_id: -1
2) topology_die_id: -1
3) topology_core_id: 0
4) topology_sibling_cpumask: just the given CPU
5) topology_core_cpumask: just the given CPU
6) topology_die_cpumask: just the given CPU
For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
default definitions for topology_book_id() and topology_book_cpumask().
For architectures that don't support drawers (CONFIG_SCHED_DRAWER) there are
no default definitions for topology_drawer_id() and topology_drawer_cpumask().
Additionally, CPU topology information is provided under
/sys/devices/system/cpu and includes these files. The internal
source for the output is in brackets ("[]").
=========== ==========================================================
kernel_max: the maximum CPU index allowed by the kernel configuration.
[NR_CPUS-1]
offline: CPUs that are not online because they have been
HOTPLUGGED off (see cpu-hotplug.txt) or exceed the limit
of CPUs allowed by the kernel configuration (kernel_max
above). [~cpu_online_mask + cpus >= NR_CPUS]
online: CPUs that are online and being scheduled [cpu_online_mask]
possible: CPUs that have been allocated resources and can be
brought online if they are present. [cpu_possible_mask]
present: CPUs that have been identified as being present in the
system. [cpu_present_mask]
=========== ==========================================================
The format for the above output is compatible with cpulist_parse()
[see <linux/cpumask.h>]. Some examples follow.
In this example, there are 64 CPUs in the system but cpus 32-63 exceed
the kernel max which is limited to 0..31 by the NR_CPUS config option
being 32. Note also that CPUs 2 and 4-31 are not online but could be
brought online as they are both present and possible::
kernel_max: 31
offline: 2,4-31,32-63
online: 0-1,3
possible: 0-31
present: 0-31
In this example, the NR_CPUS config option is 128, but the kernel was
started with possible_cpus=144. There are 4 CPUs in the system and cpu2
was manually taken offline (and is the only CPU that can be brought
online.)::
kernel_max: 127
offline: 2,4-127,128-143
online: 0-1,3
possible: 0-127
present: 0-3
See cpu-hotplug.txt for the possible_cpus=NUM kernel start parameter
as well as more information on the various cpumasks.

View File

@@ -0,0 +1,131 @@
=============================
Guidance for writing policies
=============================
Try to keep transactionality out of it. The core is careful to
avoid asking about anything that is migrating. This is a pain, but
makes it easier to write the policies.
Mappings are loaded into the policy at construction time.
Every bio that is mapped by the target is referred to the policy.
The policy can return a simple HIT or MISS or issue a migration.
Currently there's no way for the policy to issue background work,
e.g. to start writing back dirty blocks that are going to be evicted
soon.
Because we map bios, rather than requests it's easy for the policy
to get fooled by many small bios. For this reason the core target
issues periodic ticks to the policy. It's suggested that the policy
doesn't update states (eg, hit counts) for a block more than once
for each tick. The core ticks by watching bios complete, and so
trying to see when the io scheduler has let the ios run.
Overview of supplied cache replacement policies
===============================================
multiqueue (mq)
---------------
This policy is now an alias for smq (see below).
The following tunables are accepted, but have no effect::
'sequential_threshold <#nr_sequential_ios>'
'random_threshold <#nr_random_ios>'
'read_promote_adjustment <value>'
'write_promote_adjustment <value>'
'discard_promote_adjustment <value>'
Stochastic multiqueue (smq)
---------------------------
This policy is the default.
The stochastic multi-queue (smq) policy addresses some of the problems
with the multiqueue (mq) policy.
The smq policy (vs mq) offers the promise of less memory utilization,
improved performance and increased adaptability in the face of changing
workloads. smq also does not have any cumbersome tuning knobs.
Users may switch from "mq" to "smq" simply by appropriately reloading a
DM table that is using the cache target. Doing so will cause all of the
mq policy's hints to be dropped. Also, performance of the cache may
degrade slightly until smq recalculates the origin device's hotspots
that should be cached.
Memory usage
^^^^^^^^^^^^
The mq policy used a lot of memory; 88 bytes per cache block on a 64
bit machine.
smq uses 28bit indexes to implement its data structures rather than
pointers. It avoids storing an explicit hit count for each block. It
has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of
the entries (each hotspot block covers a larger area than a single
cache block).
All this means smq uses ~25bytes per cache block. Still a lot of
memory, but a substantial improvement nontheless.
Level balancing
^^^^^^^^^^^^^^^
mq placed entries in different levels of the multiqueue structures
based on their hit count (~ln(hit count)). This meant the bottom
levels generally had the most entries, and the top ones had very
few. Having unbalanced levels like this reduced the efficacy of the
multiqueue.
smq does not maintain a hit count, instead it swaps hit entries with
the least recently used entry from the level above. The overall
ordering being a side effect of this stochastic process. With this
scheme we can decide how many entries occupy each multiqueue level,
resulting in better promotion/demotion decisions.
Adaptability:
The mq policy maintained a hit count for each cache block. For a
different block to get promoted to the cache its hit count has to
exceed the lowest currently in the cache. This meant it could take a
long time for the cache to adapt between varying IO patterns.
smq doesn't maintain hit counts, so a lot of this problem just goes
away. In addition it tracks performance of the hotspot queue, which
is used to decide which blocks to promote. If the hotspot queue is
performing badly then it starts moving entries more quickly between
levels. This lets it adapt to new IO patterns very quickly.
Performance
^^^^^^^^^^^
Testing smq shows substantially better performance than mq.
cleaner
-------
The cleaner writes back all dirty blocks in a cache to decommission it.
Examples
========
The syntax for a table is::
cache <metadata dev> <cache dev> <origin dev> <block size>
<#feature_args> [<feature arg>]*
<policy> <#policy_args> [<policy arg>]*
The syntax to send a message using the dmsetup command is::
dmsetup message <mapped device> 0 sequential_threshold 1024
dmsetup message <mapped device> 0 random_threshold 8
Using dmsetup::
dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \
/dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8"
creates a 128GB large mapped device named 'blah' with the
sequential threshold set to 1024 and the random_threshold set to 8.

View File

@@ -0,0 +1,337 @@
=====
Cache
=====
Introduction
============
dm-cache is a device mapper target written by Joe Thornber, Heinz
Mauelshagen, and Mike Snitzer.
It aims to improve performance of a block device (eg, a spindle) by
dynamically migrating some of its data to a faster, smaller device
(eg, an SSD).
This device-mapper solution allows us to insert this caching at
different levels of the dm stack, for instance above the data device for
a thin-provisioning pool. Caching solutions that are integrated more
closely with the virtual memory system should give better performance.
The target reuses the metadata library used in the thin-provisioning
library.
The decision as to what data to migrate and when is left to a plug-in
policy module. Several of these have been written as we experiment,
and we hope other people will contribute others for specific io
scenarios (eg. a vm image server).
Glossary
========
Migration
Movement of the primary copy of a logical block from one
device to the other.
Promotion
Migration from slow device to fast device.
Demotion
Migration from fast device to slow device.
The origin device always contains a copy of the logical block, which
may be out of date or kept in sync with the copy on the cache device
(depending on policy).
Design
======
Sub-devices
-----------
The target is constructed by passing three devices to it (along with
other parameters detailed later):
1. An origin device - the big, slow one.
2. A cache device - the small, fast one.
3. A small metadata device - records which blocks are in the cache,
which are dirty, and extra hints for use by the policy object.
This information could be put on the cache device, but having it
separate allows the volume manager to configure it differently,
e.g. as a mirror for extra robustness. This metadata device may only
be used by a single cache device.
Fixed block size
----------------
The origin is divided up into blocks of a fixed size. This block size
is configurable when you first create the cache. Typically we've been
using block sizes of 256KB - 1024KB. The block size must be between 64
sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB).
Having a fixed block size simplifies the target a lot. But it is
something of a compromise. For instance, a small part of a block may be
getting hit a lot, yet the whole block will be promoted to the cache.
So large block sizes are bad because they waste cache space. And small
block sizes are bad because they increase the amount of metadata (both
in core and on disk).
Cache operating modes
---------------------
The cache has three operating modes: writeback, writethrough and
passthrough.
If writeback, the default, is selected then a write to a block that is
cached will go only to the cache and the block will be marked dirty in
the metadata.
If writethrough is selected then a write to a cached block will not
complete until it has hit both the origin and cache devices. Clean
blocks should remain clean.
If passthrough is selected, useful when the cache contents are not known
to be coherent with the origin device, then all reads are served from
the origin device (all reads miss the cache) and all writes are
forwarded to the origin device; additionally, write hits cause cache
block invalidates. To enable passthrough mode the cache must be clean.
Passthrough mode allows a cache device to be activated without having to
worry about coherency. Coherency that exists is maintained, although
the cache will gradually cool as writes take place. If the coherency of
the cache can later be verified, or established through use of the
"invalidate_cblocks" message, the cache device can be transitioned to
writethrough or writeback mode while still warm. Otherwise, the cache
contents can be discarded prior to transitioning to the desired
operating mode.
A simple cleaner policy is provided, which will clean (write back) all
dirty blocks in a cache. Useful for decommissioning a cache or when
shrinking a cache. Shrinking the cache's fast device requires all cache
blocks, in the area of the cache being removed, to be clean. If the
area being removed from the cache still contains dirty blocks the resize
will fail. Care must be taken to never reduce the volume used for the
cache's fast device until the cache is clean. This is of particular
importance if writeback mode is used. Writethrough and passthrough
modes already maintain a clean cache. Future support to partially clean
the cache, above a specified threshold, will allow for keeping the cache
warm and in writeback mode during resize.
Migration throttling
--------------------
Migrating data between the origin and cache device uses bandwidth.
The user can set a throttle to prevent more than a certain amount of
migration occurring at any one time. Currently we're not taking any
account of normal io traffic going to the devices. More work needs
doing here to avoid migrating during those peak io moments.
For the time being, a message "migration_threshold <#sectors>"
can be used to set the maximum number of sectors being migrated,
the default being 2048 sectors (1MB).
Updating on-disk metadata
-------------------------
On-disk metadata is committed every time a FLUSH or FUA bio is written.
If no such requests are made then commits will occur every second. This
means the cache behaves like a physical disk that has a volatile write
cache. If power is lost you may lose some recent writes. The metadata
should always be consistent in spite of any crash.
The 'dirty' state for a cache block changes far too frequently for us
to keep updating it on the fly. So we treat it as a hint. In normal
operation it will be written when the dm device is suspended. If the
system crashes all cache blocks will be assumed dirty when restarted.
Per-block policy hints
----------------------
Policy plug-ins can store a chunk of data per cache block. It's up to
the policy how big this chunk is, but it should be kept small. Like the
dirty flags this data is lost if there's a crash so a safe fallback
value should always be possible.
Policy hints affect performance, not correctness.
Policy messaging
----------------
Policies will have different tunables, specific to each one, so we
need a generic way of getting and setting these. Device-mapper
messages are used. Refer to cache-policies.txt.
Discard bitset resolution
-------------------------
We can avoid copying data during migration if we know the block has
been discarded. A prime example of this is when mkfs discards the
whole block device. We store a bitset tracking the discard state of
blocks. However, we allow this bitset to have a different block size
from the cache blocks. This is because we need to track the discard
state for all of the origin device (compare with the dirty bitset
which is just for the smaller cache device).
Target interface
================
Constructor
-----------
::
cache <metadata dev> <cache dev> <origin dev> <block size>
<#feature args> [<feature arg>]*
<policy> <#policy args> [policy args]*
================ =======================================================
metadata dev fast device holding the persistent metadata
cache dev fast device holding cached data blocks
origin dev slow device holding original data blocks
block size cache unit size in sectors
#feature args number of feature arguments passed
feature args writethrough or passthrough (The default is writeback.)
policy the replacement policy to use
#policy args an even number of arguments corresponding to
key/value pairs passed to the policy
policy args key/value pairs passed to the policy
E.g. 'sequential_threshold 1024'
See cache-policies.txt for details.
================ =======================================================
Optional feature arguments are:
==================== ========================================================
writethrough write through caching that prohibits cache block
content from being different from origin block content.
Without this argument, the default behaviour is to write
back cache block contents later for performance reasons,
so they may differ from the corresponding origin blocks.
passthrough a degraded mode useful for various cache coherency
situations (e.g., rolling back snapshots of
underlying storage). Reads and writes always go to
the origin. If a write goes to a cached origin
block, then the cache block is invalidated.
To enable passthrough mode the cache must be clean.
metadata2 use version 2 of the metadata. This stores the dirty
bits in a separate btree, which improves speed of
shutting down the cache.
no_discard_passdown disable passing down discards from the cache
to the origin's data device.
==================== ========================================================
A policy called 'default' is always registered. This is an alias for
the policy we currently think is giving best all round performance.
As the default policy could vary between kernels, if you are relying on
the characteristics of a specific policy, always request it by name.
Status
------
::
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
<cache block size> <#used cache blocks>/<#total cache blocks>
<#read hits> <#read misses> <#write hits> <#write misses>
<#demotions> <#promotions> <#dirty> <#features> <features>*
<#core args> <core args>* <policy name> <#policy args> <policy args>*
<cache metadata mode>
========================= =====================================================
metadata block size Fixed block size for each metadata block in
sectors
#used metadata blocks Number of metadata blocks used
#total metadata blocks Total number of metadata blocks
cache block size Configurable block size for the cache device
in sectors
#used cache blocks Number of blocks resident in the cache
#total cache blocks Total number of cache blocks
#read hits Number of times a READ bio has been mapped
to the cache
#read misses Number of times a READ bio has been mapped
to the origin
#write hits Number of times a WRITE bio has been mapped
to the cache
#write misses Number of times a WRITE bio has been
mapped to the origin
#demotions Number of times a block has been removed
from the cache
#promotions Number of times a block has been moved to
the cache
#dirty Number of blocks in the cache that differ
from the origin
#feature args Number of feature args to follow
feature args 'writethrough' (optional)
#core args Number of core arguments (must be even)
core args Key/value pairs for tuning the core
e.g. migration_threshold
policy name Name of the policy
#policy args Number of policy arguments to follow (must be even)
policy args Key/value pairs e.g. sequential_threshold
cache metadata mode ro if read-only, rw if read-write
In serious cases where even a read-only mode is
deemed unsafe no further I/O will be permitted and
the status will just contain the string 'Fail'.
The userspace recovery tools should then be used.
needs_check 'needs_check' if set, '-' if not set
A metadata operation has failed, resulting in the
needs_check flag being set in the metadata's
superblock. The metadata device must be
deactivated and checked/repaired before the
cache can be made fully operational again.
'-' indicates needs_check is not set.
========================= =====================================================
Messages
--------
Policies will have different tunables, specific to each one, so we
need a generic way of getting and setting these. Device-mapper
messages are used. (A sysfs interface would also be possible.)
The message format is::
<key> <value>
E.g.::
dmsetup message my_cache 0 sequential_threshold 1024
Invalidation is removing an entry from the cache without writing it
back. Cache blocks can be invalidated via the invalidate_cblocks
message, which takes an arbitrary number of cblock ranges. Each cblock
range's end value is "one past the end", meaning 5-10 expresses a range
of values from 5 to 9. Each cblock must be expressed as a decimal
value, in the future a variant message that takes cblock ranges
expressed in hexadecimal may be needed to better support efficient
invalidation of larger caches. The cache must be in passthrough mode
when invalidate_cblocks is used::
invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*
E.g.::
dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789
Examples
========
The test suite can be found here:
https://github.com/jthornber/device-mapper-test-suite
::
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
/dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
/dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
mq 4 sequential_threshold 1024 random_threshold 8'

View File

@@ -0,0 +1,31 @@
========
dm-delay
========
Device-Mapper's "delay" target delays reads and/or writes
and maps them to different devices.
Parameters::
<device> <offset> <delay> [<write_device> <write_offset> <write_delay>
[<flush_device> <flush_offset> <flush_delay>]]
With separate write parameters, the first set is only used for reads.
Offsets are specified in sectors.
Delays are specified in milliseconds.
Example scripts
===============
::
#!/bin/sh
# Create device delaying rw operation for 500ms
echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed
::
#!/bin/sh
# Create device delaying only write operation for 500ms and
# splitting reads and writes to different devices $1 $2
echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed

View File

@@ -0,0 +1,173 @@
========
dm-crypt
========
Device-Mapper's "crypt" target provides transparent encryption of block devices
using the kernel crypto API.
For a more detailed description of supported parameters see:
https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt
Parameters::
<cipher> <key> <iv_offset> <device path> \
<offset> [<#opt_params> <opt_params>]
<cipher>
Encryption cipher, encryption mode and Initial Vector (IV) generator.
The cipher specifications format is::
cipher[:keycount]-chainmode-ivmode[:ivopts]
Examples::
aes-cbc-essiv:sha256
aes-xts-plain64
serpent-xts-plain64
Cipher format also supports direct specification with kernel crypt API
format (selected by capi: prefix). The IV specification is the same
as for the first format type.
This format is mainly used for specification of authenticated modes.
The crypto API cipher specifications format is::
capi:cipher_api_spec-ivmode[:ivopts]
Examples::
capi:cbc(aes)-essiv:sha256
capi:xts(aes)-plain64
Examples of authenticated modes::
capi:gcm(aes)-random
capi:authenc(hmac(sha256),xts(aes))-random
capi:rfc7539(chacha20,poly1305)-random
The /proc/crypto contains a list of curently loaded crypto modes.
<key>
Key used for encryption. It is encoded either as a hexadecimal number
or it can be passed as <key_string> prefixed with single colon
character (':') for keys residing in kernel keyring service.
You can only use key sizes that are valid for the selected cipher
in combination with the selected iv mode.
Note that for some iv modes the key string can contain additional
keys (for example IV seed) so the key contains more parts concatenated
into a single string.
<key_string>
The kernel keyring key is identified by string in following format:
<key_size>:<key_type>:<key_description>.
<key_size>
The encryption key size in bytes. The kernel key payload size must match
the value passed in <key_size>.
<key_type>
Either 'logon' or 'user' kernel key type.
<key_description>
The kernel keyring key description crypt target should look for
when loading key of <key_type>.
<keycount>
Multi-key compatibility mode. You can define <keycount> keys and
then sectors are encrypted according to their offsets (sector 0 uses key0;
sector 1 uses key1 etc.). <keycount> must be a power of two.
<iv_offset>
The IV offset is a sector count that is added to the sector number
before creating the IV.
<device path>
This is the device that is going to be used as backend and contains the
encrypted data. You can specify it as a path like /dev/xxx or a device
number <major>:<minor>.
<offset>
Starting sector within the device where the encrypted data begins.
<#opt_params>
Number of optional parameters. If there are no optional parameters,
the optional paramaters section can be skipped or #opt_params can be zero.
Otherwise #opt_params is the number of following arguments.
Example of optional parameters section:
3 allow_discards same_cpu_crypt submit_from_crypt_cpus
allow_discards
Block discard requests (a.k.a. TRIM) are passed through the crypt device.
The default is to ignore discard requests.
WARNING: Assess the specific security risks carefully before enabling this
option. For example, allowing discards on encrypted devices may lead to
the leak of information about the ciphertext device (filesystem type,
used space etc.) if the discarded blocks can be located easily on the
device later.
same_cpu_crypt
Perform encryption using the same cpu that IO was submitted on.
The default is to use an unbound workqueue so that encryption work
is automatically balanced between available CPUs.
submit_from_crypt_cpus
Disable offloading writes to a separate thread after encryption.
There are some situations where offloading write bios from the
encryption threads to a single thread degrades performance
significantly. The default is to offload write bios to the same
thread because it benefits CFQ to have writes submitted using the
same context.
integrity:<bytes>:<type>
The device requires additional <bytes> metadata per-sector stored
in per-bio integrity structure. This metadata must by provided
by underlying dm-integrity target.
The <type> can be "none" if metadata is used only for persistent IV.
For Authenticated Encryption with Additional Data (AEAD)
the <type> is "aead". An AEAD mode additionally calculates and verifies
integrity for the encrypted device. The additional space is then
used for storing authentication tag (and persistent IV if needed).
sector_size:<bytes>
Use <bytes> as the encryption unit instead of 512 bytes sectors.
This option can be in range 512 - 4096 bytes and must be power of two.
Virtual device will announce this size as a minimal IO and logical sector.
iv_large_sectors
IV generators will use sector number counted in <sector_size> units
instead of default 512 bytes sectors.
For example, if <sector_size> is 4096 bytes, plain64 IV for the second
sector will be 8 (without flag) and 1 if iv_large_sectors is present.
The <iv_offset> must be multiple of <sector_size> (in 512 bytes units)
if this flag is specified.
Example scripts
===============
LUKS (Linux Unified Key Setup) is now the preferred way to set up disk
encryption with dm-crypt using the 'cryptsetup' utility, see
https://gitlab.com/cryptsetup/cryptsetup
::
#!/bin/sh
# Create a crypt device using dmsetup
dmsetup create crypt1 --table "0 `blockdev --getsz $1` crypt aes-cbc-essiv:sha256 babebabebabebabebabebabebabebabe 0 $1 0"
::
#!/bin/sh
# Create a crypt device using dmsetup when encryption key is stored in keyring service
dmsetup create crypt2 --table "0 `blockdev --getsize $1` crypt aes-cbc-essiv:sha256 :32:logon:my_prefix:my_key 0 $1 0"
::
#!/bin/sh
# Create a crypt device using cryptsetup and LUKS header with default cipher
cryptsetup luksFormat $1
cryptsetup luksOpen $1 crypt1

View File

@@ -0,0 +1,272 @@
dm-dust
=======
This target emulates the behavior of bad sectors at arbitrary
locations, and the ability to enable the emulation of the failures
at an arbitrary time.
This target behaves similarly to a linear target. At a given time,
the user can send a message to the target to start failing read
requests on specific blocks (to emulate the behavior of a hard disk
drive with bad sectors).
When the failure behavior is enabled (i.e.: when the output of
"dmsetup status" displays "fail_read_on_bad_block"), reads of blocks
in the "bad block list" will fail with EIO ("Input/output error").
Writes of blocks in the "bad block list will result in the following:
1. Remove the block from the "bad block list".
2. Successfully complete the write.
This emulates the "remapped sector" behavior of a drive with bad
sectors.
Normally, a drive that is encountering bad sectors will most likely
encounter more bad sectors, at an unknown time or location.
With dm-dust, the user can use the "addbadblock" and "removebadblock"
messages to add arbitrary bad blocks at new locations, and the
"enable" and "disable" messages to modulate the state of whether the
configured "bad blocks" will be treated as bad, or bypassed.
This allows the pre-writing of test data and metadata prior to
simulating a "failure" event where bad sectors start to appear.
Table parameters:
-----------------
<device_path> <offset> <blksz>
Mandatory parameters:
<device_path>: path to the block device.
<offset>: offset to data area from start of device_path
<blksz>: block size in bytes
(minimum 512, maximum 1073741824, must be a power of 2)
Usage instructions:
-------------------
First, find the size (in 512-byte sectors) of the device to be used:
$ sudo blockdev --getsz /dev/vdb1
33552384
Create the dm-dust device:
(For a device with a block size of 512 bytes)
$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 512'
(For a device with a block size of 4096 bytes)
$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 4096'
Check the status of the read behavior ("bypass" indicates that all I/O
will be passed through to the underlying device):
$ sudo dmsetup status dust1
0 33552384 dust 252:17 bypass
$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=128 iflag=direct
128+0 records in
128+0 records out
$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
128+0 records in
128+0 records out
Adding and removing bad blocks:
-------------------------------
At any time (i.e.: whether the device has the "bad block" emulation
enabled or disabled), bad blocks may be added or removed from the
device via the "addbadblock" and "removebadblock" messages:
$ sudo dmsetup message dust1 0 addbadblock 60
kernel: device-mapper: dust: badblock added at block 60
$ sudo dmsetup message dust1 0 addbadblock 67
kernel: device-mapper: dust: badblock added at block 67
$ sudo dmsetup message dust1 0 addbadblock 72
kernel: device-mapper: dust: badblock added at block 72
These bad blocks will be stored in the "bad block list".
While the device is in "bypass" mode, reads and writes will succeed:
$ sudo dmsetup status dust1
0 33552384 dust 252:17 bypass
Enabling block read failures:
-----------------------------
To enable the "fail read on bad block" behavior, send the "enable" message:
$ sudo dmsetup message dust1 0 enable
kernel: device-mapper: dust: enabling read failures on bad sectors
$ sudo dmsetup status dust1
0 33552384 dust 252:17 fail_read_on_bad_block
With the device in "fail read on bad block" mode, attempting to read a
block will encounter an "Input/output error":
$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=1 skip=67 iflag=direct
dd: error reading '/dev/mapper/dust1': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.00040651 s, 0.0 kB/s
...and writing to the bad blocks will remove the blocks from the list,
therefore emulating the "remap" behavior of hard disk drives:
$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
128+0 records in
128+0 records out
kernel: device-mapper: dust: block 60 removed from badblocklist by write
kernel: device-mapper: dust: block 67 removed from badblocklist by write
kernel: device-mapper: dust: block 72 removed from badblocklist by write
kernel: device-mapper: dust: block 87 removed from badblocklist by write
Bad block add/remove error handling:
------------------------------------
Attempting to add a bad block that already exists in the list will
result in an "Invalid argument" error, as well as a helpful message:
$ sudo dmsetup message dust1 0 addbadblock 88
device-mapper: message ioctl on dust1 failed: Invalid argument
kernel: device-mapper: dust: block 88 already in badblocklist
Attempting to remove a bad block that doesn't exist in the list will
result in an "Invalid argument" error, as well as a helpful message:
$ sudo dmsetup message dust1 0 removebadblock 87
device-mapper: message ioctl on dust1 failed: Invalid argument
kernel: device-mapper: dust: block 87 not found in badblocklist
Counting the number of bad blocks in the bad block list:
--------------------------------------------------------
To count the number of bad blocks configured in the device, run the
following message command:
$ sudo dmsetup message dust1 0 countbadblocks
A message will print with the number of bad blocks currently
configured on the device:
kernel: device-mapper: dust: countbadblocks: 895 badblock(s) found
Querying for specific bad blocks:
---------------------------------
To find out if a specific block is in the bad block list, run the
following message command:
$ sudo dmsetup message dust1 0 queryblock 72
The following message will print if the block is in the list:
device-mapper: dust: queryblock: block 72 found in badblocklist
The following message will print if the block is in the list:
device-mapper: dust: queryblock: block 72 not found in badblocklist
The "queryblock" message command will work in both the "enabled"
and "disabled" modes, allowing the verification of whether a block
will be treated as "bad" without having to issue I/O to the device,
or having to "enable" the bad block emulation.
Clearing the bad block list:
----------------------------
To clear the bad block list (without needing to individually run
a "removebadblock" message command for every block), run the
following message command:
$ sudo dmsetup message dust1 0 clearbadblocks
After clearing the bad block list, the following message will appear:
kernel: device-mapper: dust: clearbadblocks: badblocks cleared
If there were no bad blocks to clear, the following message will
appear:
kernel: device-mapper: dust: clearbadblocks: no badblocks found
Message commands list:
----------------------
Below is a list of the messages that can be sent to a dust device:
Operations on blocks (requires a <blknum> argument):
addbadblock <blknum>
queryblock <blknum>
removebadblock <blknum>
...where <blknum> is a block number within range of the device
(corresponding to the block size of the device.)
Single argument message commands:
countbadblocks
clearbadblocks
disable
enable
quiet
Device removal:
---------------
When finished, remove the device via the "dmsetup remove" command:
$ sudo dmsetup remove dust1
Quiet mode:
-----------
On test runs with many bad blocks, it may be desirable to avoid
excessive logging (from bad blocks added, removed, or "remapped").
This can be done by enabling "quiet mode" via the following message:
$ sudo dmsetup message dust1 0 quiet
This will suppress log messages from add / remove / removed by write
operations. Log messages from "countbadblocks" or "queryblock"
message commands will still print in quiet mode.
The status of quiet mode can be seen by running "dmsetup status":
$ sudo dmsetup status dust1
0 33552384 dust 252:17 fail_read_on_bad_block quiet
To disable quiet mode, send the "quiet" message again:
$ sudo dmsetup message dust1 0 quiet
$ sudo dmsetup status dust1
0 33552384 dust 252:17 fail_read_on_bad_block verbose
(The presence of "verbose" indicates normal logging.)
"Why not...?"
-------------
scsi_debug has a "medium error" mode that can fail reads on one
specified sector (sector 0x1234, hardcoded in the source code), but
it uses RAM for the persistent storage, which drastically decreases
the potential device size.
dm-flakey fails all I/O from all block locations at a specified time
frequency, and not a given point in time.
When a bad sector occurs on a hard disk drive, reads to that sector
are failed by the device, usually resulting in an error code of EIO
("I/O error") or ENODATA ("No data available"). However, a write to
the sector may succeed, and result in the sector becoming readable
after the device controller no longer experiences errors reading the
sector (or after a reallocation of the sector). However, there may
be bad sectors that occur on the device in the future, in a different,
unpredictable location.
This target seeks to provide a device that can exhibit the behavior
of a bad sector at a known sector location, at a known time, based
on a large storage device (at least tens of gigabytes, not occupying
system memory).

View File

@@ -0,0 +1,74 @@
=========
dm-flakey
=========
This target is the same as the linear target except that it exhibits
unreliable behaviour periodically. It's been found useful in simulating
failing devices for testing purposes.
Starting from the time the table is loaded, the device is available for
<up interval> seconds, then exhibits unreliable behaviour for <down
interval> seconds, and then this cycle repeats.
Also, consider using this in combination with the dm-delay target too,
which can delay reads and writes and/or send them to different
underlying devices.
Table parameters
----------------
::
<dev path> <offset> <up interval> <down interval> \
[<num_features> [<feature arguments>]]
Mandatory parameters:
<dev path>:
Full pathname to the underlying block-device, or a
"major:minor" device-number.
<offset>:
Starting sector within the device.
<up interval>:
Number of seconds device is available.
<down interval>:
Number of seconds device returns errors.
Optional feature parameters:
If no feature parameters are present, during the periods of
unreliability, all I/O returns errors.
drop_writes:
All write I/O is silently ignored.
Read I/O is handled correctly.
error_writes:
All write I/O is failed with an error signalled.
Read I/O is handled correctly.
corrupt_bio_byte <Nth_byte> <direction> <value> <flags>:
During <down interval>, replace <Nth_byte> of the data of
each matching bio with <value>.
<Nth_byte>:
The offset of the byte to replace.
Counting starts at 1, to replace the first byte.
<direction>:
Either 'r' to corrupt reads or 'w' to corrupt writes.
'w' is incompatible with drop_writes.
<value>:
The value (from 0-255) to write.
<flags>:
Perform the replacement only if bio->bi_opf has all the
selected flags set.
Examples:
Replaces the 32nd byte of READ bios with the value 1::
corrupt_bio_byte 32 r 1 0
Replaces the 224th byte of REQ_META (=32) bios with the value 0::
corrupt_bio_byte 224 w 0 32

View File

@@ -0,0 +1,125 @@
================================
Early creation of mapped devices
================================
It is possible to configure a device-mapper device to act as the root device for
your system in two ways.
The first is to build an initial ramdisk which boots to a minimal userspace
which configures the device, then pivot_root(8) in to it.
The second is to create one or more device-mappers using the module parameter
"dm-mod.create=" through the kernel boot command line argument.
The format is specified as a string of data separated by commas and optionally
semi-colons, where:
- a comma is used to separate fields like name, uuid, flags and table
(specifies one device)
- a semi-colon is used to separate devices.
So the format will look like this::
dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
Where::
<name> ::= The device name.
<uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ""
<minor> ::= The device minor number | ""
<flags> ::= "ro" | "rw"
<table> ::= <start_sector> <num_sectors> <target_type> <target_args>
<target_type> ::= "verity" | "linear" | ... (see list below)
The dm line should be equivalent to the one used by the dmsetup tool with the
`--concise` argument.
Target types
============
Not all target types are available as there are serious risks in allowing
activation of certain DM targets without first using userspace tools to check
the validity of associated metadata.
======================= =======================================================
`cache` constrained, userspace should verify cache device
`crypt` allowed
`delay` allowed
`era` constrained, userspace should verify metadata device
`flakey` constrained, meant for test
`linear` allowed
`log-writes` constrained, userspace should verify metadata device
`mirror` constrained, userspace should verify main/mirror device
`raid` constrained, userspace should verify metadata device
`snapshot` constrained, userspace should verify src/dst device
`snapshot-origin` allowed
`snapshot-merge` constrained, userspace should verify src/dst device
`striped` allowed
`switch` constrained, userspace should verify dev path
`thin` constrained, requires dm target message from userspace
`thin-pool` constrained, requires dm target message from userspace
`verity` allowed
`writecache` constrained, userspace should verify cache device
`zero` constrained, not meant for rootfs
======================= =======================================================
If the target is not listed above, it is constrained by default (not tested).
Examples
========
An example of booting to a linear array made up of user-mode linux block
devices::
dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0
This will boot to a rw dm-linear target of 8192 sectors split across two block
devices identified by their major:minor numbers. After boot, udev will rename
this target to /dev/mapper/lroot (depending on the rules). No uuid was assigned.
An example of multiple device-mappers, with the dm-mod.create="..." contents
is shown here split on multiple lines for readability::
dm-linear,,1,rw,
0 32768 linear 8:1 0,
32768 1024000 linear 8:2 0;
dm-verity,,3,ro,
0 1638400 verity 1 /dev/sdc1 /dev/sdc2 4096 4096 204800 1 sha256
ac87db56303c9c1da433d7209b5a6ef3e4779df141200cbd7c157dcb8dd89c42
5ebfe87f7df3235b80a117ebc4078e44f55045487ad4a96581d1adb564615b51
Other examples (per target):
"crypt"::
dm-crypt,,8,ro,
0 1048576 crypt aes-xts-plain64
babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe 0
/dev/sda 0 1 allow_discards
"delay"::
dm-delay,,4,ro,0 409600 delay /dev/sda1 0 500
"linear"::
dm-linear,,,rw,
0 32768 linear /dev/sda1 0,
32768 1024000 linear /dev/sda2 0,
1056768 204800 linear /dev/sda3 0,
1261568 512000 linear /dev/sda4 0
"snapshot-origin"::
dm-snap-orig,,4,ro,0 409600 snapshot-origin 8:2
"striped"::
dm-striped,,4,ro,0 1638400 striped 4 4096
/dev/sda1 0 /dev/sda2 0 /dev/sda3 0 /dev/sda4 0
"verity"::
dm-verity,,4,ro,
0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256
fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd
51934789604d1b92399c52e7cb149d1b3a1b74bbbcb103b2a0aaacbed5c08584

View File

@@ -0,0 +1,259 @@
============
dm-integrity
============
The dm-integrity target emulates a block device that has additional
per-sector tags that can be used for storing integrity information.
A general problem with storing integrity tags with every sector is that
writing the sector and the integrity tag must be atomic - i.e. in case of
crash, either both sector and integrity tag or none of them is written.
To guarantee write atomicity, the dm-integrity target uses journal, it
writes sector data and integrity tags into a journal, commits the journal
and then copies the data and integrity tags to their respective location.
The dm-integrity target can be used with the dm-crypt target - in this
situation the dm-crypt target creates the integrity data and passes them
to the dm-integrity target via bio_integrity_payload attached to the bio.
In this mode, the dm-crypt and dm-integrity targets provide authenticated
disk encryption - if the attacker modifies the encrypted device, an I/O
error is returned instead of random data.
The dm-integrity target can also be used as a standalone target, in this
mode it calculates and verifies the integrity tag internally. In this
mode, the dm-integrity target can be used to detect silent data
corruption on the disk or in the I/O path.
There's an alternate mode of operation where dm-integrity uses bitmap
instead of a journal. If a bit in the bitmap is 1, the corresponding
region's data and integrity tags are not synchronized - if the machine
crashes, the unsynchronized regions will be recalculated. The bitmap mode
is faster than the journal mode, because we don't have to write the data
twice, but it is also less reliable, because if data corruption happens
when the machine crashes, it may not be detected.
When loading the target for the first time, the kernel driver will format
the device. But it will only format the device if the superblock contains
zeroes. If the superblock is neither valid nor zeroed, the dm-integrity
target can't be loaded.
To use the target for the first time:
1. overwrite the superblock with zeroes
2. load the dm-integrity target with one-sector size, the kernel driver
will format the device
3. unload the dm-integrity target
4. read the "provided_data_sectors" value from the superblock
5. load the dm-integrity target with the the target size
"provided_data_sectors"
6. if you want to use dm-integrity with dm-crypt, load the dm-crypt target
with the size "provided_data_sectors"
Target arguments:
1. the underlying block device
2. the number of reserved sector at the beginning of the device - the
dm-integrity won't read of write these sectors
3. the size of the integrity tag (if "-" is used, the size is taken from
the internal-hash algorithm)
4. mode:
D - direct writes (without journal)
in this mode, journaling is
not used and data sectors and integrity tags are written
separately. In case of crash, it is possible that the data
and integrity tag doesn't match.
J - journaled writes
data and integrity tags are written to the
journal and atomicity is guaranteed. In case of crash,
either both data and tag or none of them are written. The
journaled mode degrades write throughput twice because the
data have to be written twice.
B - bitmap mode - data and metadata are written without any
synchronization, the driver maintains a bitmap of dirty
regions where data and metadata don't match. This mode can
only be used with internal hash.
R - recovery mode - in this mode, journal is not replayed,
checksums are not checked and writes to the device are not
allowed. This mode is useful for data recovery if the
device cannot be activated in any of the other standard
modes.
5. the number of additional arguments
Additional arguments:
journal_sectors:number
The size of journal, this argument is used only if formatting the
device. If the device is already formatted, the value from the
superblock is used.
interleave_sectors:number
The number of interleaved sectors. This values is rounded down to
a power of two. If the device is already formatted, the value from
the superblock is used.
meta_device:device
Don't interleave the data and metadata on on device. Use a
separate device for metadata.
buffer_sectors:number
The number of sectors in one buffer. The value is rounded down to
a power of two.
The tag area is accessed using buffers, the buffer size is
configurable. The large buffer size means that the I/O size will
be larger, but there could be less I/Os issued.
journal_watermark:number
The journal watermark in percents. When the size of the journal
exceeds this watermark, the thread that flushes the journal will
be started.
commit_time:number
Commit time in milliseconds. When this time passes, the journal is
written. The journal is also written immediatelly if the FLUSH
request is received.
internal_hash:algorithm(:key) (the key is optional)
Use internal hash or crc.
When this argument is used, the dm-integrity target won't accept
integrity tags from the upper target, but it will automatically
generate and verify the integrity tags.
You can use a crc algorithm (such as crc32), then integrity target
will protect the data against accidental corruption.
You can also use a hmac algorithm (for example
"hmac(sha256):0123456789abcdef"), in this mode it will provide
cryptographic authentication of the data without encryption.
When this argument is not used, the integrity tags are accepted
from an upper layer target, such as dm-crypt. The upper layer
target should check the validity of the integrity tags.
recalculate
Recalculate the integrity tags automatically. It is only valid
when using internal hash.
journal_crypt:algorithm(:key) (the key is optional)
Encrypt the journal using given algorithm to make sure that the
attacker can't read the journal. You can use a block cipher here
(such as "cbc(aes)") or a stream cipher (for example "chacha20",
"salsa20", "ctr(aes)" or "ecb(arc4)").
The journal contains history of last writes to the block device,
an attacker reading the journal could see the last sector nubmers
that were written. From the sector numbers, the attacker can infer
the size of files that were written. To protect against this
situation, you can encrypt the journal.
journal_mac:algorithm(:key) (the key is optional)
Protect sector numbers in the journal from accidental or malicious
modification. To protect against accidental modification, use a
crc algorithm, to protect against malicious modification, use a
hmac algorithm with a key.
This option is not needed when using internal-hash because in this
mode, the integrity of journal entries is checked when replaying
the journal. Thus, modified sector number would be detected at
this stage.
block_size:number
The size of a data block in bytes. The larger the block size the
less overhead there is for per-block integrity metadata.
Supported values are 512, 1024, 2048 and 4096 bytes. If not
specified the default block size is 512 bytes.
sectors_per_bit:number
In the bitmap mode, this parameter specifies the number of
512-byte sectors that corresponds to one bitmap bit.
bitmap_flush_interval:number
The bitmap flush interval in milliseconds. The metadata buffers
are synchronized when this interval expires.
The journal mode (D/J), buffer_sectors, journal_watermark, commit_time can
be changed when reloading the target (load an inactive table and swap the
tables with suspend and resume). The other arguments should not be changed
when reloading the target because the layout of disk data depend on them
and the reloaded target would be non-functional.
The layout of the formatted block device:
* reserved sectors
(they are not used by this target, they can be used for
storing LUKS metadata or for other purpose), the size of the reserved
area is specified in the target arguments
* superblock (4kiB)
* magic string - identifies that the device was formatted
* version
* log2(interleave sectors)
* integrity tag size
* the number of journal sections
* provided data sectors - the number of sectors that this target
provides (i.e. the size of the device minus the size of all
metadata and padding). The user of this target should not send
bios that access data beyond the "provided data sectors" limit.
* flags
SB_FLAG_HAVE_JOURNAL_MAC
- a flag is set if journal_mac is used
SB_FLAG_RECALCULATING
- recalculating is in progress
SB_FLAG_DIRTY_BITMAP
- journal area contains the bitmap of dirty
blocks
* log2(sectors per block)
* a position where recalculating finished
* journal
The journal is divided into sections, each section contains:
* metadata area (4kiB), it contains journal entries
- every journal entry contains:
* logical sector (specifies where the data and tag should
be written)
* last 8 bytes of data
* integrity tag (the size is specified in the superblock)
- every metadata sector ends with
* mac (8-bytes), all the macs in 8 metadata sectors form a
64-byte value. It is used to store hmac of sector
numbers in the journal section, to protect against a
possibility that the attacker tampers with sector
numbers in the journal.
* commit id
* data area (the size is variable; it depends on how many journal
entries fit into the metadata area)
- every sector in the data area contains:
* data (504 bytes of data, the last 8 bytes are stored in
the journal entry)
* commit id
To test if the whole journal section was written correctly, every
512-byte sector of the journal ends with 8-byte commit id. If the
commit id matches on all sectors in a journal section, then it is
assumed that the section was written correctly. If the commit id
doesn't match, the section was written partially and it should not
be replayed.
* one or more runs of interleaved tags and data.
Each run contains:
* tag area - it contains integrity tags. There is one tag for each
sector in the data area
* data area - it contains data sectors. The number of data sectors
in one run must be a power of two. log2 of this value is stored
in the superblock.

View File

@@ -0,0 +1,75 @@
=====
dm-io
=====
Dm-io provides synchronous and asynchronous I/O services. There are three
types of I/O services available, and each type has a sync and an async
version.
The user must set up an io_region structure to describe the desired location
of the I/O. Each io_region indicates a block-device along with the starting
sector and size of the region::
struct io_region {
struct block_device *bdev;
sector_t sector;
sector_t count;
};
Dm-io can read from one io_region or write to one or more io_regions. Writes
to multiple regions are specified by an array of io_region structures.
The first I/O service type takes a list of memory pages as the data buffer for
the I/O, along with an offset into the first page::
struct page_list {
struct page_list *next;
struct page *page;
};
int dm_io_sync(unsigned int num_regions, struct io_region *where, int rw,
struct page_list *pl, unsigned int offset,
unsigned long *error_bits);
int dm_io_async(unsigned int num_regions, struct io_region *where, int rw,
struct page_list *pl, unsigned int offset,
io_notify_fn fn, void *context);
The second I/O service type takes an array of bio vectors as the data buffer
for the I/O. This service can be handy if the caller has a pre-assembled bio,
but wants to direct different portions of the bio to different devices::
int dm_io_sync_bvec(unsigned int num_regions, struct io_region *where,
int rw, struct bio_vec *bvec,
unsigned long *error_bits);
int dm_io_async_bvec(unsigned int num_regions, struct io_region *where,
int rw, struct bio_vec *bvec,
io_notify_fn fn, void *context);
The third I/O service type takes a pointer to a vmalloc'd memory buffer as the
data buffer for the I/O. This service can be handy if the caller needs to do
I/O to a large region but doesn't want to allocate a large number of individual
memory pages::
int dm_io_sync_vm(unsigned int num_regions, struct io_region *where, int rw,
void *data, unsigned long *error_bits);
int dm_io_async_vm(unsigned int num_regions, struct io_region *where, int rw,
void *data, io_notify_fn fn, void *context);
Callers of the asynchronous I/O services must include the name of a completion
callback routine and a pointer to some context data for the I/O::
typedef void (*io_notify_fn)(unsigned long error, void *context);
The "error" parameter in this callback, as well as the `*error` parameter in
all of the synchronous versions, is a bitset (instead of a simple error value).
In the case of an write-I/O to multiple regions, this bitset allows dm-io to
indicate success or failure on each individual region.
Before using any of the dm-io services, the user should call dm_io_get()
and specify the number of pages they expect to perform I/O on concurrently.
Dm-io will attempt to resize its mempool to make sure enough pages are
always available in order to avoid unnecessary waiting while performing I/O.
When the user is finished using the dm-io services, they should call
dm_io_put() and specify the same number of pages that were given on the
dm_io_get() call.

View File

@@ -0,0 +1,57 @@
=====================
Device-Mapper Logging
=====================
The device-mapper logging code is used by some of the device-mapper
RAID targets to track regions of the disk that are not consistent.
A region (or portion of the address space) of the disk may be
inconsistent because a RAID stripe is currently being operated on or
a machine died while the region was being altered. In the case of
mirrors, a region would be considered dirty/inconsistent while you
are writing to it because the writes need to be replicated for all
the legs of the mirror and may not reach the legs at the same time.
Once all writes are complete, the region is considered clean again.
There is a generic logging interface that the device-mapper RAID
implementations use to perform logging operations (see
dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different
logging implementations are available and provide different
capabilities. The list includes:
============== ==============================================================
Type Files
============== ==============================================================
disk drivers/md/dm-log.c
core drivers/md/dm-log.c
userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h
============== ==============================================================
The "disk" log type
-------------------
This log implementation commits the log state to disk. This way, the
logging state survives reboots/crashes.
The "core" log type
-------------------
This log implementation keeps the log state in memory. The log state
will not survive a reboot or crash, but there may be a small boost in
performance. This method can also be used if no storage device is
available for storing log state.
The "userspace" log type
------------------------
This log type simply provides a way to export the log API to userspace,
so log implementations can be done there. This is done by forwarding most
logging requests to userspace, where a daemon receives and processes the
request.
The structure used for communication between kernel and userspace are
located in include/linux/dm-log-userspace.h. Due to the frequency,
diversity, and 2-way communication nature of the exchanges between
kernel and userspace, 'connector' is used as the interface for
communication.
There are currently two userspace log implementations that leverage this
framework - "clustered-disk" and "clustered-core". These implementations
provide a cluster-coherent log for shared-storage. Device-mapper mirroring
can be used in a shared-storage environment when the cluster log implementations
are employed.

View File

@@ -0,0 +1,48 @@
===============
dm-queue-length
===============
dm-queue-length is a path selector module for device-mapper targets,
which selects a path with the least number of in-flight I/Os.
The path selector name is 'queue-length'.
Table parameters for each path: [<repeat_count>]
::
<repeat_count>: The number of I/Os to dispatch using the selected
path before switching to the next path.
If not given, internal default is used. To check
the default value, see the activated table.
Status for each path: <status> <fail-count> <in-flight>
::
<status>: 'A' if the path is active, 'F' if the path is failed.
<fail-count>: The number of path failures.
<in-flight>: The number of in-flight I/Os on the path.
Algorithm
=========
dm-queue-length increments/decrements 'in-flight' when an I/O is
dispatched/completed respectively.
dm-queue-length selects a path with the minimum 'in-flight'.
Examples
========
In case that 2 paths (sda and sdb) are used with repeat_count == 128.
::
# echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \
dmsetup create test
#
# dmsetup table
test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128
#
# dmsetup status
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0

View File

@@ -0,0 +1,419 @@
=======
dm-raid
=======
The device-mapper RAID (dm-raid) target provides a bridge from DM to MD.
It allows the MD RAID drivers to be accessed using a device-mapper
interface.
Mapping Table Interface
-----------------------
The target is named "raid" and it accepts the following parameters::
<raid_type> <#raid_params> <raid_params> \
<#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>]
<raid_type>:
============= ===============================================================
raid0 RAID0 striping (no resilience)
raid1 RAID1 mirroring
raid4 RAID4 with dedicated last parity disk
raid5_n RAID5 with dedicated last parity disk supporting takeover
Same as raid4
- Transitory layout
raid5_la RAID5 left asymmetric
- rotating parity 0 with data continuation
raid5_ra RAID5 right asymmetric
- rotating parity N with data continuation
raid5_ls RAID5 left symmetric
- rotating parity 0 with data restart
raid5_rs RAID5 right symmetric
- rotating parity N with data restart
raid6_zr RAID6 zero restart
- rotating parity zero (left-to-right) with data restart
raid6_nr RAID6 N restart
- rotating parity N (right-to-left) with data restart
raid6_nc RAID6 N continue
- rotating parity N (right-to-left) with data continuation
raid6_n_6 RAID6 with dedicate parity disks
- parity and Q-syndrome on the last 2 disks;
layout for takeover from/to raid4/raid5_n
raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk
- layout for takeover from raid5_la from/to raid6
raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk
- layout for takeover from raid5_ra from/to raid6
raid6_ls_6 Same as "raid5_ls" dedicated last Q-syndrome disk
- layout for takeover from raid5_ls from/to raid6
raid6_rs_6 Same as "raid5_rs" dedicated last Q-syndrome disk
- layout for takeover from raid5_rs from/to raid6
raid10 Various RAID10 inspired algorithms chosen by additional params
(see raid10_format and raid10_copies below)
- RAID10: Striped Mirrors (aka 'Striping on top of mirrors')
- RAID1E: Integrated Adjacent Stripe Mirroring
- RAID1E: Integrated Offset Stripe Mirroring
- and other similar RAID10 variants
============= ===============================================================
Reference: Chapter 4 of
http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf
<#raid_params>: The number of parameters that follow.
<raid_params> consists of
Mandatory parameters:
<chunk_size>:
Chunk size in sectors. This parameter is often known as
"stripe size". It is the only mandatory parameter and
is placed first.
followed by optional parameters (in any order):
[sync|nosync]
Force or prevent RAID initialization.
[rebuild <idx>]
Rebuild drive number 'idx' (first drive is 0).
[daemon_sleep <ms>]
Interval between runs of the bitmap daemon that
clear bits. A longer interval means less bitmap I/O but
resyncing after a failure is likely to take longer.
[min_recovery_rate <kB/sec/disk>]
Throttle RAID initialization
[max_recovery_rate <kB/sec/disk>]
Throttle RAID initialization
[write_mostly <idx>]
Mark drive index 'idx' write-mostly.
[max_write_behind <sectors>]
See '--write-behind=' (man mdadm)
[stripe_cache <sectors>]
Stripe cache size (RAID 4/5/6 only)
[region_size <sectors>]
The region_size multiplied by the number of regions is the
logical size of the array. The bitmap records the device
synchronisation state for each region.
[raid10_copies <# copies>], [raid10_format <near|far|offset>]
These two options are used to alter the default layout of
a RAID10 configuration. The number of copies is can be
specified, but the default is 2. There are also three
variations to how the copies are laid down - the default
is "near". Near copies are what most people think of with
respect to mirroring. If these options are left unspecified,
or 'raid10_copies 2' and/or 'raid10_format near' are given,
then the layouts for 2, 3 and 4 devices are:
======== ========== ==============
2 drives 3 drives 4 drives
======== ========== ==============
A1 A1 A1 A1 A2 A1 A1 A2 A2
A2 A2 A2 A3 A3 A3 A3 A4 A4
A3 A3 A4 A4 A5 A5 A5 A6 A6
A4 A4 A5 A6 A6 A7 A7 A8 A8
.. .. .. .. .. .. .. .. ..
======== ========== ==============
The 2-device layout is equivalent 2-way RAID1. The 4-device
layout is what a traditional RAID10 would look like. The
3-device layout is what might be called a 'RAID1E - Integrated
Adjacent Stripe Mirroring'.
If 'raid10_copies 2' and 'raid10_format far', then the layouts
for 2, 3 and 4 devices are:
======== ============ ===================
2 drives 3 drives 4 drives
======== ============ ===================
A1 A2 A1 A2 A3 A1 A2 A3 A4
A3 A4 A4 A5 A6 A5 A6 A7 A8
A5 A6 A7 A8 A9 A9 A10 A11 A12
.. .. .. .. .. .. .. .. ..
A2 A1 A3 A1 A2 A2 A1 A4 A3
A4 A3 A6 A4 A5 A6 A5 A8 A7
A6 A5 A9 A7 A8 A10 A9 A12 A11
.. .. .. .. .. .. .. .. ..
======== ============ ===================
If 'raid10_copies 2' and 'raid10_format offset', then the
layouts for 2, 3 and 4 devices are:
======== ========== ================
2 drives 3 drives 4 drives
======== ========== ================
A1 A2 A1 A2 A3 A1 A2 A3 A4
A2 A1 A3 A1 A2 A2 A1 A4 A3
A3 A4 A4 A5 A6 A5 A6 A7 A8
A4 A3 A6 A4 A5 A6 A5 A8 A7
A5 A6 A7 A8 A9 A9 A10 A11 A12
A6 A5 A9 A7 A8 A10 A9 A12 A11
.. .. .. .. .. .. .. .. ..
======== ========== ================
Here we see layouts closely akin to 'RAID1E - Integrated
Offset Stripe Mirroring'.
[delta_disks <N>]
The delta_disks option value (-251 < N < +251) triggers
device removal (negative value) or device addition (positive
value) to any reshape supporting raid levels 4/5/6 and 10.
RAID levels 4/5/6 allow for addition of devices (metadata
and data device tuple), raid10_near and raid10_offset only
allow for device addition. raid10_far does not support any
reshaping at all.
A minimum of devices have to be kept to enforce resilience,
which is 3 devices for raid4/5 and 4 devices for raid6.
[data_offset <sectors>]
This option value defines the offset into each data device
where the data starts. This is used to provide out-of-place
reshaping space to avoid writing over data while
changing the layout of stripes, hence an interruption/crash
may happen at any time without the risk of losing data.
E.g. when adding devices to an existing raid set during
forward reshaping, the out-of-place space will be allocated
at the beginning of each raid device. The kernel raid4/5/6/10
MD personalities supporting such device addition will read the data from
the existing first stripes (those with smaller number of stripes)
starting at data_offset to fill up a new stripe with the larger
number of stripes, calculate the redundancy blocks (CRC/Q-syndrome)
and write that new stripe to offset 0. Same will be applied to all
N-1 other new stripes. This out-of-place scheme is used to change
the RAID type (i.e. the allocation algorithm) as well, e.g.
changing from raid5_ls to raid5_n.
[journal_dev <dev>]
This option adds a journal device to raid4/5/6 raid sets and
uses it to close the 'write hole' caused by the non-atomic updates
to the component devices which can cause data loss during recovery.
The journal device is used as writethrough thus causing writes to
be throttled versus non-journaled raid4/5/6 sets.
Takeover/reshape is not possible with a raid4/5/6 journal device;
it has to be deconfigured before requesting these.
[journal_mode <mode>]
This option sets the caching mode on journaled raid4/5/6 raid sets
(see 'journal_dev <dev>' above) to 'writethrough' or 'writeback'.
If 'writeback' is selected the journal device has to be resilient
and must not suffer from the 'write hole' problem itself (e.g. use
raid1 or raid10) to avoid a single point of failure.
<#raid_devs>: The number of devices composing the array.
Each device consists of two entries. The first is the device
containing the metadata (if any); the second is the one containing the
data. A Maximum of 64 metadata/data device entries are supported
up to target version 1.8.0.
1.9.0 supports up to 253 which is enforced by the used MD kernel runtime.
If a drive has failed or is missing at creation time, a '-' can be
given for both the metadata and data drives for a given position.
Example Tables
--------------
::
# RAID4 - 4 data drives, 1 parity (no metadata devices)
# No metadata devices specified to hold superblock/bitmap info
# Chunk size of 1MiB
# (Lines separated for easy reading)
0 1960893648 raid \
raid4 1 2048 \
5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
# RAID4 - 4 data drives, 1 parity (with metadata devices)
# Chunk size of 1MiB, force RAID initialization,
# min recovery rate at 20 kiB/sec/disk
0 1960893648 raid \
raid4 4 2048 sync min_recovery_rate 20 \
5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82
Status Output
-------------
'dmsetup table' displays the table used to construct the mapping.
The optional parameters are always printed in the order listed
above with "sync" or "nosync" always output ahead of the other
arguments, regardless of the order used when originally loading the table.
Arguments that can be repeated are ordered by value.
'dmsetup status' yields information on the state and health of the array.
The output is as follows (normally a single line, but expanded here for
clarity)::
1: <s> <l> raid \
2: <raid_type> <#devices> <health_chars> \
3: <sync_ratio> <sync_action> <mismatch_cnt>
Line 1 is the standard output produced by device-mapper.
Line 2 & 3 are produced by the raid target and are best explained by example::
0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0
Here we can see the RAID type is raid4, there are 5 devices - all of
which are 'A'live, and the array is 2/490221568 complete with its initial
recovery. Here is a fuller description of the individual fields:
=============== =========================================================
<raid_type> Same as the <raid_type> used to create the array.
<health_chars> One char for each device, indicating:
- 'A' = alive and in-sync
- 'a' = alive but not in-sync
- 'D' = dead/failed.
<sync_ratio> The ratio indicating how much of the array has undergone
the process described by 'sync_action'. If the
'sync_action' is "check" or "repair", then the process
of "resync" or "recover" can be considered complete.
<sync_action> One of the following possible states:
idle
- No synchronization action is being performed.
frozen
- The current action has been halted.
resync
- Array is undergoing its initial synchronization
or is resynchronizing after an unclean shutdown
(possibly aided by a bitmap).
recover
- A device in the array is being rebuilt or
replaced.
check
- A user-initiated full check of the array is
being performed. All blocks are read and
checked for consistency. The number of
discrepancies found are recorded in
<mismatch_cnt>. No changes are made to the
array by this action.
repair
- The same as "check", but discrepancies are
corrected.
reshape
- The array is undergoing a reshape.
<mismatch_cnt> The number of discrepancies found between mirror copies
in RAID1/10 or wrong parity values found in RAID4/5/6.
This value is valid only after a "check" of the array
is performed. A healthy array has a 'mismatch_cnt' of 0.
<data_offset> The current data offset to the start of the user data on
each component device of a raid set (see the respective
raid parameter to support out-of-place reshaping).
<journal_char> - 'A' - active write-through journal device.
- 'a' - active write-back journal device.
- 'D' - dead journal device.
- '-' - no journal device.
=============== =========================================================
Message Interface
-----------------
The dm-raid target will accept certain actions through the 'message' interface.
('man dmsetup' for more information on the message interface.) These actions
include:
========= ================================================
"idle" Halt the current sync action.
"frozen" Freeze the current sync action.
"resync" Initiate/continue a resync.
"recover" Initiate/continue a recover process.
"check" Initiate a check (i.e. a "scrub") of the array.
"repair" Initiate a repair of the array.
========= ================================================
Discard Support
---------------
The implementation of discard support among hardware vendors varies.
When a block is discarded, some storage devices will return zeroes when
the block is read. These devices set the 'discard_zeroes_data'
attribute. Other devices will return random data. Confusingly, some
devices that advertise 'discard_zeroes_data' will not reliably return
zeroes when discarded blocks are read! Since RAID 4/5/6 uses blocks
from a number of devices to calculate parity blocks and (for performance
reasons) relies on 'discard_zeroes_data' being reliable, it is important
that the devices be consistent. Blocks may be discarded in the middle
of a RAID 4/5/6 stripe and if subsequent read results are not
consistent, the parity blocks may be calculated differently at any time;
making the parity blocks useless for redundancy. It is important to
understand how your hardware behaves with discards if you are going to
enable discards with RAID 4/5/6.
Since the behavior of storage devices is unreliable in this respect,
even when reporting 'discard_zeroes_data', by default RAID 4/5/6
discard support is disabled -- this ensures data integrity at the
expense of losing some performance.
Storage devices that properly support 'discard_zeroes_data' are
increasingly whitelisted in the kernel and can thus be trusted.
For trusted devices, the following dm-raid module parameter can be set
to safely enable discard support for RAID 4/5/6:
'devices_handle_discards_safely'
Version History
---------------
::
1.0.0 Initial version. Support for RAID 4/5/6
1.1.0 Added support for RAID 1
1.2.0 Handle creation of arrays that contain failed devices.
1.3.0 Added support for RAID 10
1.3.1 Allow device replacement/rebuild for RAID 10
1.3.2 Fix/improve redundancy checking for RAID10
1.4.0 Non-functional change. Removes arg from mapping function.
1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5).
1.4.2 Add RAID10 "far" and "offset" algorithm support.
1.5.0 Add message interface to allow manipulation of the sync_action.
New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
1.5.1 Add ability to restore transiently failed devices on resume.
1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check".
1.6.0 Add discard support (and devices_handle_discard_safely module param).
1.7.0 Add support for MD RAID0 mappings.
1.8.0 Explicitly check for compatible flags in the superblock metadata
and reject to start the raid set if any are set by a newer
target version, thus avoiding data corruption on a raid set
with a reshape in progress.
1.9.0 Add support for RAID level takeover/reshape/region size
and set size reduction.
1.9.1 Fix activation of existing RAID 4/10 mapped devices
1.9.2 Don't emit '- -' on the status table line in case the constructor
fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and
'D' on the status line. If '- -' is passed into the constructor, emit
'- -' on the table line and '-' as the status line health character.
1.10.0 Add support for raid4/5/6 journal device
1.10.1 Fix data corruption on reshape request
1.11.0 Fix table line argument order
(wrong raid10_copies/raid10_format sequence)
1.11.1 Add raid4/5/6 journal write-back support via journal_mode option
1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an
state races.
1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen
1.14.0 Fix reshape race on small devices. Fix stripe adding reshape
deadlock/potential data corruption. Update superblock when
specific devices are requested via rebuild. Fix RAID leg
rebuild errors.

View File

@@ -0,0 +1,101 @@
===============
dm-service-time
===============
dm-service-time is a path selector module for device-mapper targets,
which selects a path with the shortest estimated service time for
the incoming I/O.
The service time for each path is estimated by dividing the total size
of in-flight I/Os on a path with the performance value of the path.
The performance value is a relative throughput value among all paths
in a path-group, and it can be specified as a table argument.
The path selector name is 'service-time'.
Table parameters for each path:
[<repeat_count> [<relative_throughput>]]
<repeat_count>:
The number of I/Os to dispatch using the selected
path before switching to the next path.
If not given, internal default is used. To check
the default value, see the activated table.
<relative_throughput>:
The relative throughput value of the path
among all paths in the path-group.
The valid range is 0-100.
If not given, minimum value '1' is used.
If '0' is given, the path isn't selected while
other paths having a positive value are available.
Status for each path:
<status> <fail-count> <in-flight-size> <relative_throughput>
<status>:
'A' if the path is active, 'F' if the path is failed.
<fail-count>:
The number of path failures.
<in-flight-size>:
The size of in-flight I/Os on the path.
<relative_throughput>:
The relative throughput value of the path
among all paths in the path-group.
Algorithm
=========
dm-service-time adds the I/O size to 'in-flight-size' when the I/O is
dispatched and subtracts when completed.
Basically, dm-service-time selects a path having minimum service time
which is calculated by::
('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput'
However, some optimizations below are used to reduce the calculation
as much as possible.
1. If the paths have the same 'relative_throughput', skip
the division and just compare the 'in-flight-size'.
2. If the paths have the same 'in-flight-size', skip the division
and just compare the 'relative_throughput'.
3. If some paths have non-zero 'relative_throughput' and others
have zero 'relative_throughput', ignore those paths with zero
'relative_throughput'.
If such optimizations can't be applied, calculate service time, and
compare service time.
If calculated service time is equal, the path having maximum
'relative_throughput' may be better. So compare 'relative_throughput'
then.
Examples
========
In case that 2 paths (sda and sdb) are used with repeat_count == 128
and sda has an average throughput 1GB/s and sdb has 4GB/s,
'relative_throughput' value may be '1' for sda and '4' for sdb::
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \
dmsetup create test
#
# dmsetup table
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4
#
# dmsetup status
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4
Or '2' for sda and '8' for sdb would be also true::
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \
dmsetup create test
#
# dmsetup table
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8
#
# dmsetup status
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8

View File

@@ -0,0 +1,110 @@
====================
device-mapper uevent
====================
The device-mapper uevent code adds the capability to device-mapper to create
and send kobject uevents (uevents). Previously device-mapper events were only
available through the ioctl interface. The advantage of the uevents interface
is the event contains environment attributes providing increased context for
the event avoiding the need to query the state of the device-mapper device after
the event is received.
There are two functions currently for device-mapper events. The first function
listed creates the event and the second function sends the event(s)::
void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti,
const char *path, unsigned nr_valid_paths)
void dm_send_uevents(struct list_head *events, struct kobject *kobj)
The variables added to the uevent environment are:
Variable Name: DM_TARGET
------------------------
:Uevent Action(s): KOBJ_CHANGE
:Type: string
:Description:
:Value: Name of device-mapper target that generated the event.
Variable Name: DM_ACTION
------------------------
:Uevent Action(s): KOBJ_CHANGE
:Type: string
:Description:
:Value: Device-mapper specific action that caused the uevent action.
PATH_FAILED - A path has failed;
PATH_REINSTATED - A path has been reinstated.
Variable Name: DM_SEQNUM
------------------------
:Uevent Action(s): KOBJ_CHANGE
:Type: unsigned integer
:Description: A sequence number for this specific device-mapper device.
:Value: Valid unsigned integer range.
Variable Name: DM_PATH
----------------------
:Uevent Action(s): KOBJ_CHANGE
:Type: string
:Description: Major and minor number of the path device pertaining to this
event.
:Value: Path name in the form of "Major:Minor"
Variable Name: DM_NR_VALID_PATHS
--------------------------------
:Uevent Action(s): KOBJ_CHANGE
:Type: unsigned integer
:Description:
:Value: Valid unsigned integer range.
Variable Name: DM_NAME
----------------------
:Uevent Action(s): KOBJ_CHANGE
:Type: string
:Description: Name of the device-mapper device.
:Value: Name
Variable Name: DM_UUID
----------------------
:Uevent Action(s): KOBJ_CHANGE
:Type: string
:Description: UUID of the device-mapper device.
:Value: UUID. (Empty string if there isn't one.)
An example of the uevents generated as captured by udevmonitor is shown
below
1.) Path failure::
UEVENT[1192521009.711215] change@/block/dm-3
ACTION=change
DEVPATH=/block/dm-3
SUBSYSTEM=block
DM_TARGET=multipath
DM_ACTION=PATH_FAILED
DM_SEQNUM=1
DM_PATH=8:32
DM_NR_VALID_PATHS=0
DM_NAME=mpath2
DM_UUID=mpath-35333333000002328
MINOR=3
MAJOR=253
SEQNUM=1130
2.) Path reinstate::
UEVENT[1192521132.989927] change@/block/dm-3
ACTION=change
DEVPATH=/block/dm-3
SUBSYSTEM=block
DM_TARGET=multipath
DM_ACTION=PATH_REINSTATED
DM_SEQNUM=2
DM_PATH=8:32
DM_NR_VALID_PATHS=1
DM_NAME=mpath2
DM_UUID=mpath-35333333000002328
MINOR=3
MAJOR=253
SEQNUM=1131

View File

@@ -0,0 +1,146 @@
========
dm-zoned
========
The dm-zoned device mapper target exposes a zoned block device (ZBC and
ZAC compliant devices) as a regular block device without any write
pattern constraints. In effect, it implements a drive-managed zoned
block device which hides from the user (a file system or an application
doing raw block device accesses) the sequential write constraints of
host-managed zoned block devices and can mitigate the potential
device-side performance degradation due to excessive random writes on
host-aware zoned block devices.
For a more detailed description of the zoned block device models and
their constraints see (for SCSI devices):
http://www.t10.org/drafts.htm#ZBC_Family
and (for ATA devices):
http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
The dm-zoned implementation is simple and minimizes system overhead (CPU
and memory usage as well as storage capacity loss). For a 10TB
host-managed disk with 256 MB zones, dm-zoned memory usage per disk
instance is at most 4.5 MB and as little as 5 zones will be used
internally for storing metadata and performaing reclaim operations.
dm-zoned target devices are formatted and checked using the dmzadm
utility available at:
https://github.com/hgst/dm-zoned-tools
Algorithm
=========
dm-zoned implements an on-disk buffering scheme to handle non-sequential
write accesses to the sequential zones of a zoned block device.
Conventional zones are used for caching as well as for storing internal
metadata.
The zones of the device are separated into 2 types:
1) Metadata zones: these are conventional zones used to store metadata.
Metadata zones are not reported as useable capacity to the user.
2) Data zones: all remaining zones, the vast majority of which will be
sequential zones used exclusively to store user data. The conventional
zones of the device may be used also for buffering user random writes.
Data in these zones may be directly mapped to the conventional zone, but
later moved to a sequential zone so that the conventional zone can be
reused for buffering incoming random writes.
dm-zoned exposes a logical device with a sector size of 4096 bytes,
irrespective of the physical sector size of the backend zoned block
device being used. This allows reducing the amount of metadata needed to
manage valid blocks (blocks written).
The on-disk metadata format is as follows:
1) The first block of the first conventional zone found contains the
super block which describes the on disk amount and position of metadata
blocks.
2) Following the super block, a set of blocks is used to describe the
mapping of the logical device blocks. The mapping is done per chunk of
blocks, with the chunk size equal to the zoned block device size. The
mapping table is indexed by chunk number and each mapping entry
indicates the zone number of the device storing the chunk of data. Each
mapping entry may also indicate if the zone number of a conventional
zone used to buffer random modification to the data zone.
3) A set of blocks used to store bitmaps indicating the validity of
blocks in the data zones follows the mapping table. A valid block is
defined as a block that was written and not discarded. For a buffered
data chunk, a block is always valid only in the data zone mapping the
chunk or in the buffer zone of the chunk.
For a logical chunk mapped to a conventional zone, all write operations
are processed by directly writing to the zone. If the mapping zone is a
sequential zone, the write operation is processed directly only if the
write offset within the logical chunk is equal to the write pointer
offset within of the sequential data zone (i.e. the write operation is
aligned on the zone write pointer). Otherwise, write operations are
processed indirectly using a buffer zone. In that case, an unused
conventional zone is allocated and assigned to the chunk being
accessed. Writing a block to the buffer zone of a chunk will
automatically invalidate the same block in the sequential zone mapping
the chunk. If all blocks of the sequential zone become invalid, the zone
is freed and the chunk buffer zone becomes the primary zone mapping the
chunk, resulting in native random write performance similar to a regular
block device.
Read operations are processed according to the block validity
information provided by the bitmaps. Valid blocks are read either from
the sequential zone mapping a chunk, or if the chunk is buffered, from
the buffer zone assigned. If the accessed chunk has no mapping, or the
accessed blocks are invalid, the read buffer is zeroed and the read
operation terminated.
After some time, the limited number of convnetional zones available may
be exhausted (all used to map chunks or buffer sequential zones) and
unaligned writes to unbuffered chunks become impossible. To avoid this
situation, a reclaim process regularly scans used conventional zones and
tries to reclaim the least recently used zones by copying the valid
blocks of the buffer zone to a free sequential zone. Once the copy
completes, the chunk mapping is updated to point to the sequential zone
and the buffer zone freed for reuse.
Metadata Protection
===================
To protect metadata against corruption in case of sudden power loss or
system crash, 2 sets of metadata zones are used. One set, the primary
set, is used as the main metadata region, while the secondary set is
used as a staging area. Modified metadata is first written to the
secondary set and validated by updating the super block in the secondary
set, a generation counter is used to indicate that this set contains the
newest metadata. Once this operation completes, in place of metadata
block updates can be done in the primary metadata set. This ensures that
one of the set is always consistent (all modifications committed or none
at all). Flush operations are used as a commit point. Upon reception of
a flush request, metadata modification activity is temporarily blocked
(for both incoming BIO processing and reclaim process) and all dirty
metadata blocks are staged and updated. Normal operation is then
resumed. Flushing metadata thus only temporarily delays write and
discard requests. Read requests can be processed concurrently while
metadata flush is being executed.
Usage
=====
A zoned block device must first be formatted using the dmzadm tool. This
will analyze the device zone configuration, determine where to place the
metadata sets on the device and initialize the metadata sets.
Ex::
dmzadm --format /dev/sdxx
For a formatted device, the target can be created normally with the
dmsetup utility. The only parameter that dm-zoned requires is the
underlying zoned block device name. Ex::
echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | \
dmsetup create dmz-`basename ${dev}`

View File

@@ -0,0 +1,116 @@
======
dm-era
======
Introduction
============
dm-era is a target that behaves similar to the linear target. In
addition it keeps track of which blocks were written within a user
defined period of time called an 'era'. Each era target instance
maintains the current era as a monotonically increasing 32-bit
counter.
Use cases include tracking changed blocks for backup software, and
partially invalidating the contents of a cache to restore cache
coherency after rolling back a vendor snapshot.
Constructor
===========
era <metadata dev> <origin dev> <block size>
================ ======================================================
metadata dev fast device holding the persistent metadata
origin dev device holding data blocks that may change
block size block size of origin data device, granularity that is
tracked by the target
================ ======================================================
Messages
========
None of the dm messages take any arguments.
checkpoint
----------
Possibly move to a new era. You shouldn't assume the era has
incremented. After sending this message, you should check the
current era via the status line.
take_metadata_snap
------------------
Create a clone of the metadata, to allow a userland process to read it.
drop_metadata_snap
------------------
Drop the metadata snapshot.
Status
======
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
<current era> <held metadata root | '-'>
========================= ==============================================
metadata block size Fixed block size for each metadata block in
sectors
#used metadata blocks Number of metadata blocks used
#total metadata blocks Total number of metadata blocks
current era The current era
held metadata root The location, in blocks, of the metadata root
that has been 'held' for userspace read
access. '-' indicates there is no held root
========================= ==============================================
Detailed use case
=================
The scenario of invalidating a cache when rolling back a vendor
snapshot was the primary use case when developing this target:
Taking a vendor snapshot
------------------------
- Send a checkpoint message to the era target
- Make a note of the current era in its status line
- Take vendor snapshot (the era and snapshot should be forever
associated now).
Rolling back to an vendor snapshot
----------------------------------
- Cache enters passthrough mode (see: dm-cache's docs in cache.txt)
- Rollback vendor storage
- Take metadata snapshot
- Ascertain which blocks have been written since the snapshot was taken
by checking each block's era
- Invalidate those blocks in the caching software
- Cache returns to writeback/writethrough mode
Memory usage
============
The target uses a bitset to record writes in the current era. It also
has a spare bitset ready for switching over to a new era. Other than
that it uses a few 4k blocks for updating metadata::
(4 * nr_blocks) bytes + buffers
Resilience
==========
Metadata is updated on disk before a write to a previously unwritten
block is performed. As such dm-era should not be effected by a hard
crash such as power failure.
Userland tools
==============
Userland tools are found in the increasingly poorly named
thin-provisioning-tools project:
https://github.com/jthornber/thin-provisioning-tools

View File

@@ -0,0 +1,42 @@
=============
Device Mapper
=============
.. toctree::
:maxdepth: 1
cache-policies
cache
delay
dm-crypt
dm-flakey
dm-init
dm-integrity
dm-io
dm-log
dm-queue-length
dm-raid
dm-service-time
dm-uevent
dm-zoned
era
kcopyd
linear
log-writes
persistent-data
snapshot
statistics
striped
switch
thin-provisioning
unstriped
verity
writecache
zero
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@@ -0,0 +1,47 @@
======
kcopyd
======
Kcopyd provides the ability to copy a range of sectors from one block-device
to one or more other block-devices, with an asynchronous completion
notification. It is used by dm-snapshot and dm-mirror.
Users of kcopyd must first create a client and indicate how many memory pages
to set aside for their copy jobs. This is done with a call to
kcopyd_client_create()::
int kcopyd_client_create(unsigned int num_pages,
struct kcopyd_client **result);
To start a copy job, the user must set up io_region structures to describe
the source and destinations of the copy. Each io_region indicates a
block-device along with the starting sector and size of the region. The source
of the copy is given as one io_region structure, and the destinations of the
copy are given as an array of io_region structures::
struct io_region {
struct block_device *bdev;
sector_t sector;
sector_t count;
};
To start the copy, the user calls kcopyd_copy(), passing in the client
pointer, pointers to the source and destination io_regions, the name of a
completion callback routine, and a pointer to some context data for the copy::
int kcopyd_copy(struct kcopyd_client *kc, struct io_region *from,
unsigned int num_dests, struct io_region *dests,
unsigned int flags, kcopyd_notify_fn fn, void *context);
typedef void (*kcopyd_notify_fn)(int read_err, unsigned int write_err,
void *context);
When the copy completes, kcopyd will call the user's completion routine,
passing back the user's context pointer. It will also indicate if a read or
write error occurred during the copy.
When a user is done with all their copy jobs, they should call
kcopyd_client_destroy() to delete the kcopyd client, which will release the
associated memory pages::
void kcopyd_client_destroy(struct kcopyd_client *kc);

View File

@@ -0,0 +1,63 @@
=========
dm-linear
=========
Device-Mapper's "linear" target maps a linear range of the Device-Mapper
device onto a linear range of another device. This is the basic building
block of logical volume managers.
Parameters: <dev path> <offset>
<dev path>:
Full pathname to the underlying block-device, or a
"major:minor" device-number.
<offset>:
Starting sector within the device.
Example scripts
===============
::
#!/bin/sh
# Create an identity mapping for a device
echo "0 `blockdev --getsz $1` linear $1 0" | dmsetup create identity
::
#!/bin/sh
# Join 2 devices together
size1=`blockdev --getsz $1`
size2=`blockdev --getsz $2`
echo "0 $size1 linear $1 0
$size1 $size2 linear $2 0" | dmsetup create joined
::
#!/usr/bin/perl -w
# Split a device into 4M chunks and then join them together in reverse order.
my $name = "reverse";
my $extent_size = 4 * 1024 * 2;
my $dev = $ARGV[0];
my $table = "";
my $count = 0;
if (!defined($dev)) {
die("Please specify a device.\n");
}
my $dev_size = `blockdev --getsz $dev`;
my $extents = int($dev_size / $extent_size) -
(($dev_size % $extent_size) ? 1 : 0);
while ($extents > 0) {
my $this_start = $count * $extent_size;
$extents--;
$count++;
my $this_offset = $extents * $extent_size;
$table .= "$this_start $extent_size linear $dev $this_offset\n";
}
`echo \"$table\" | dmsetup create $name`;

View File

@@ -0,0 +1,145 @@
=============
dm-log-writes
=============
This target takes 2 devices, one to pass all IO to normally, and one to log all
of the write operations to. This is intended for file system developers wishing
to verify the integrity of metadata or data as the file system is written to.
There is a log_write_entry written for every WRITE request and the target is
able to take arbitrary data from userspace to insert into the log. The data
that is in the WRITE requests is copied into the log to make the replay happen
exactly as it happened originally.
Log Ordering
============
We log things in order of completion once we are sure the write is no longer in
cache. This means that normal WRITE requests are not actually logged until the
next REQ_PREFLUSH request. This is to make it easier for userspace to replay
the log in a way that correlates to what is on disk and not what is in cache,
to make it easier to detect improper waiting/flushing.
This works by attaching all WRITE requests to a list once the write completes.
Once we see a REQ_PREFLUSH request we splice this list onto the request and once
the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only
completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
simulate the worst case scenario with regard to power failures. Consider the
following example (W means write, C means complete):
W1,W2,W3,C3,C2,Wflush,C1,Cflush
The log would show the following:
W3,W2,flush,W1....
Again this is to simulate what is actually on disk, this allows us to detect
cases where a power failure at a particular point in time would create an
inconsistent file system.
Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
they complete as those requests will obviously bypass the device cache.
Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would
have all the DISCARD requests, and then the WRITE requests and then the FLUSH
request. Consider the following example:
WRITE block 1, DISCARD block 1, FLUSH
If we logged DISCARD when it completed, the replay would look like this:
DISCARD 1, WRITE 1, FLUSH
which isn't quite what happened and wouldn't be caught during the log replay.
Target interface
================
i) Constructor
log-writes <dev_path> <log_dev_path>
============= ==============================================
dev_path Device that all of the IO will go to normally.
log_dev_path Device where the log entries are written to.
============= ==============================================
ii) Status
<#logged entries> <highest allocated sector>
=========================== ========================
#logged entries Number of logged entries
highest allocated sector Highest allocated sector
=========================== ========================
iii) Messages
mark <description>
You can use a dmsetup message to set an arbitrary mark in a log.
For example say you want to fsck a file system after every
write, but first you need to replay up to the mkfs to make sure
we're fsck'ing something reasonable, you would do something like
this::
mkfs.btrfs -f /dev/mapper/log
dmsetup message log 0 mark mkfs
<run test>
This would allow you to replay the log up to the mkfs mark and
then replay from that point on doing the fsck check in the
interval that you want.
Every log has a mark at the end labeled "dm-log-writes-end".
Userspace component
===================
There is a userspace tool that will replay the log for you in various ways.
It can be found here: https://github.com/josefbacik/log-writes
Example usage
=============
Say you want to test fsync on your file system. You would do something like
this::
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
dmsetup create log --table "$TABLE"
mkfs.btrfs -f /dev/mapper/log
dmsetup message log 0 mark mkfs
mount /dev/mapper/log /mnt/btrfs-test
<some test that does fsync at the end>
dmsetup message log 0 mark fsync
md5sum /mnt/btrfs-test/foo
umount /mnt/btrfs-test
dmsetup remove log
replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
mount /dev/sdb /mnt/btrfs-test
md5sum /mnt/btrfs-test/foo
<verify md5sum's are correct>
Another option is to do a complicated file system operation and verify the file
system is consistent during the entire operation. You could do this with:
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
dmsetup create log --table "$TABLE"
mkfs.btrfs -f /dev/mapper/log
dmsetup message log 0 mark mkfs
mount /dev/mapper/log /mnt/btrfs-test
<fsstress to dirty the fs>
btrfs filesystem balance /mnt/btrfs-test
umount /mnt/btrfs-test
dmsetup remove log
replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
btrfsck /dev/sdb
replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
--fsck "btrfsck /dev/sdb" --check fua
And that will replay the log until it sees a FUA request, run the fsck command
and if the fsck passes it will replay to the next FUA, until it is completed or
the fsck command exists abnormally.

View File

@@ -0,0 +1,88 @@
===============
Persistent data
===============
Introduction
============
The more-sophisticated device-mapper targets require complex metadata
that is managed in kernel. In late 2010 we were seeing that various
different targets were rolling their own data structures, for example:
- Mikulas Patocka's multisnap implementation
- Heinz Mauelshagen's thin provisioning target
- Another btree-based caching target posted to dm-devel
- Another multi-snapshot target based on a design of Daniel Phillips
Maintaining these data structures takes a lot of work, so if possible
we'd like to reduce the number.
The persistent-data library is an attempt to provide a re-usable
framework for people who want to store metadata in device-mapper
targets. It's currently used by the thin-provisioning target and an
upcoming hierarchical storage target.
Overview
========
The main documentation is in the header files which can all be found
under drivers/md/persistent-data.
The block manager
-----------------
dm-block-manager.[hc]
This provides access to the data on disk in fixed sized-blocks. There
is a read/write locking interface to prevent concurrent accesses, and
keep data that is being used in the cache.
Clients of persistent-data are unlikely to use this directly.
The transaction manager
-----------------------
dm-transaction-manager.[hc]
This restricts access to blocks and enforces copy-on-write semantics.
The only way you can get hold of a writable block through the
transaction manager is by shadowing an existing block (ie. doing
copy-on-write) or allocating a fresh one. Shadowing is elided within
the same transaction so performance is reasonable. The commit method
ensures that all data is flushed before it writes the superblock.
On power failure your metadata will be as it was when last committed.
The Space Maps
--------------
dm-space-map.h
dm-space-map-metadata.[hc]
dm-space-map-disk.[hc]
On-disk data structures that keep track of reference counts of blocks.
Also acts as the allocator of new blocks. Currently two
implementations: a simpler one for managing blocks on a different
device (eg. thinly-provisioned data blocks); and one for managing
the metadata space. The latter is complicated by the need to store
its own data within the space it's managing.
The data structures
-------------------
dm-btree.[hc]
dm-btree-remove.c
dm-btree-spine.c
dm-btree-internal.h
Currently there is only one data structure, a hierarchical btree.
There are plans to add more. For example, something with an
array-like interface would see a lot of use.
The btree is 'hierarchical' in that you can define it to be composed
of nested btrees, and take multiple keys. For example, the
thin-provisioning target uses a btree with two levels of nesting.
The first maps a device id to a mapping tree, and that in turn maps a
virtual block to a physical block.
Values stored in the btrees can have arbitrary size. Keys are always
64bits, although nesting allows you to use multiple keys.

View File

@@ -0,0 +1,196 @@
==============================
Device-mapper snapshot support
==============================
Device-mapper allows you, without massive data copying:
- To create snapshots of any block device i.e. mountable, saved states of
the block device which are also writable without interfering with the
original content;
- To create device "forks", i.e. multiple different versions of the
same data stream.
- To merge a snapshot of a block device back into the snapshot's origin
device.
In the first two cases, dm copies only the chunks of data that get
changed and uses a separate copy-on-write (COW) block device for
storage.
For snapshot merge the contents of the COW storage are merged back into
the origin device.
There are three dm targets available:
snapshot, snapshot-origin, and snapshot-merge.
- snapshot-origin <origin>
which will normally have one or more snapshots based on it.
Reads will be mapped directly to the backing device. For each write, the
original data will be saved in the <COW device> of each snapshot to keep
its visible content unchanged, at least until the <COW device> fills up.
- snapshot <origin> <COW device> <persistent?> <chunksize>
[<# feature args> [<arg>]*]
A snapshot of the <origin> block device is created. Changed chunks of
<chunksize> sectors will be stored on the <COW device>. Writes will
only go to the <COW device>. Reads will come from the <COW device> or
from <origin> for unchanged data. <COW device> will often be
smaller than the origin and if it fills up the snapshot will become
useless and be disabled, returning errors. So it is important to monitor
the amount of free space and expand the <COW device> before it fills up.
<persistent?> is P (Persistent) or N (Not persistent - will not survive
after reboot). O (Overflow) can be added as a persistent store option
to allow userspace to advertise its support for seeing "Overflow" in the
snapshot status. So supported store types are "P", "PO" and "N".
The difference between persistent and transient is with transient
snapshots less metadata must be saved on disk - they can be kept in
memory by the kernel.
When loading or unloading the snapshot target, the corresponding
snapshot-origin or snapshot-merge target must be suspended. A failure to
suspend the origin target could result in data corruption.
Optional features:
discard_zeroes_cow - a discard issued to the snapshot device that
maps to entire chunks to will zero the corresponding exception(s) in
the snapshot's exception store.
discard_passdown_origin - a discard to the snapshot device is passed
down to the snapshot-origin's underlying device. This doesn't cause
copy-out to the snapshot exception store because the snapshot-origin
target is bypassed.
The discard_passdown_origin feature depends on the discard_zeroes_cow
feature being enabled.
- snapshot-merge <origin> <COW device> <persistent> <chunksize>
[<# feature args> [<arg>]*]
takes the same table arguments as the snapshot target except it only
works with persistent snapshots. This target assumes the role of the
"snapshot-origin" target and must not be loaded if the "snapshot-origin"
is still present for <origin>.
Creates a merging snapshot that takes control of the changed chunks
stored in the <COW device> of an existing snapshot, through a handover
procedure, and merges these chunks back into the <origin>. Once merging
has started (in the background) the <origin> may be opened and the merge
will continue while I/O is flowing to it. Changes to the <origin> are
deferred until the merging snapshot's corresponding chunk(s) have been
merged. Once merging has started the snapshot device, associated with
the "snapshot" target, will return -EIO when accessed.
How snapshot is used by LVM2
============================
When you create the first LVM2 snapshot of a volume, four dm devices are used:
1) a device containing the original mapping table of the source volume;
2) a device used as the <COW device>;
3) a "snapshot" device, combining #1 and #2, which is the visible snapshot
volume;
4) the "original" volume (which uses the device number used by the original
source volume), whose table is replaced by a "snapshot-origin" mapping
from device #1.
A fixed naming scheme is used, so with the following commands::
lvcreate -L 1G -n base volumeGroup
lvcreate -L 100M --snapshot -n snap volumeGroup/base
we'll have this situation (with volumes in above order)::
# dmsetup table|grep volumeGroup
volumeGroup-base-real: 0 2097152 linear 8:19 384
volumeGroup-snap-cow: 0 204800 linear 8:19 2097536
volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16
volumeGroup-base: 0 2097152 snapshot-origin 254:11
# ls -lL /dev/mapper/volumeGroup-*
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow
brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap
brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base
How snapshot-merge is used by LVM2
==================================
A merging snapshot assumes the role of the "snapshot-origin" while
merging. As such the "snapshot-origin" is replaced with
"snapshot-merge". The "-real" device is not changed and the "-cow"
device is renamed to <origin name>-cow to aid LVM2's cleanup of the
merging snapshot after it completes. The "snapshot" that hands over its
COW device to the "snapshot-merge" is deactivated (unless using lvchange
--refresh); but if it is left active it will simply return I/O errors.
A snapshot will merge into its origin with the following command::
lvconvert --merge volumeGroup/snap
we'll now have this situation::
# dmsetup table|grep volumeGroup
volumeGroup-base-real: 0 2097152 linear 8:19 384
volumeGroup-base-cow: 0 204800 linear 8:19 2097536
volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16
# ls -lL /dev/mapper/volumeGroup-*
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow
brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base
How to determine when a merging is complete
===========================================
The snapshot-merge and snapshot status lines end with:
<sectors_allocated>/<total_sectors> <metadata_sectors>
Both <sectors_allocated> and <total_sectors> include both data and metadata.
During merging, the number of sectors allocated gets smaller and
smaller. Merging has finished when the number of sectors holding data
is zero, in other words <sectors_allocated> == <metadata_sectors>.
Here is a practical example (using a hybrid of lvm and dmsetup commands)::
# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
base volumeGroup owi-a- 4.00g
snap volumeGroup swi-a- 1.00g base 18.97
# dmsetup status volumeGroup-snap
0 8388608 snapshot 397896/2097152 1560
^^^^ metadata sectors
# lvconvert --merge -b volumeGroup/snap
Merging of volume snap started.
# lvs volumeGroup/snap
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
base volumeGroup Owi-a- 4.00g 17.23
# dmsetup status volumeGroup-base
0 8388608 snapshot-merge 281688/2097152 1104
# dmsetup status volumeGroup-base
0 8388608 snapshot-merge 180480/2097152 712
# dmsetup status volumeGroup-base
0 8388608 snapshot-merge 16/2097152 16
Merging has finished.
::
# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
base volumeGroup owi-a- 4.00g

View File

@@ -0,0 +1,225 @@
=============
DM statistics
=============
Device Mapper supports the collection of I/O statistics on user-defined
regions of a DM device. If no regions are defined no statistics are
collected so there isn't any performance impact. Only bio-based DM
devices are currently supported.
Each user-defined region specifies a starting sector, length and step.
Individual statistics will be collected for each step-sized area within
the range specified.
The I/O statistics counters for each step-sized area of a region are
in the same format as `/sys/block/*/stat` or `/proc/diskstats` (see:
Documentation/admin-guide/iostats.rst). But two extra counters (12 and 13) are
provided: total time spent reading and writing. When the histogram
argument is used, the 14th parameter is reported that represents the
histogram of latencies. All these counters may be accessed by sending
the @stats_print message to the appropriate DM device via dmsetup.
The reported times are in milliseconds and the granularity depends on
the kernel ticks. When the option precise_timestamps is used, the
reported times are in nanoseconds.
Each region has a corresponding unique identifier, which we call a
region_id, that is assigned when the region is created. The region_id
must be supplied when querying statistics about the region, deleting the
region, etc. Unique region_ids enable multiple userspace programs to
request and process statistics for the same DM device without stepping
on each other's data.
The creation of DM statistics will allocate memory via kmalloc or
fallback to using vmalloc space. At most, 1/4 of the overall system
memory may be allocated by DM statistics. The admin can see how much
memory is used by reading:
/sys/module/dm_mod/parameters/stats_current_allocated_bytes
Messages
========
@stats_create <range> <step> [<number_of_optional_arguments> <optional_arguments>...] [<program_id> [<aux_data>]]
Create a new region and return the region_id.
<range>
"-"
whole device
"<start_sector>+<length>"
a range of <length> 512-byte sectors
starting with <start_sector>.
<step>
"<area_size>"
the range is subdivided into areas each containing
<area_size> sectors.
"/<number_of_areas>"
the range is subdivided into the specified
number of areas.
<number_of_optional_arguments>
The number of optional arguments
<optional_arguments>
The following optional arguments are supported:
precise_timestamps
use precise timer with nanosecond resolution
instead of the "jiffies" variable. When this argument is
used, the resulting times are in nanoseconds instead of
milliseconds. Precise timestamps are a little bit slower
to obtain than jiffies-based timestamps.
histogram:n1,n2,n3,n4,...
collect histogram of latencies. The
numbers n1, n2, etc are times that represent the boundaries
of the histogram. If precise_timestamps is not used, the
times are in milliseconds, otherwise they are in
nanoseconds. For each range, the kernel will report the
number of requests that completed within this range. For
example, if we use "histogram:10,20,30", the kernel will
report four numbers a:b:c:d. a is the number of requests
that took 0-10 ms to complete, b is the number of requests
that took 10-20 ms to complete, c is the number of requests
that took 20-30 ms to complete and d is the number of
requests that took more than 30 ms to complete.
<program_id>
An optional parameter. A name that uniquely identifies
the userspace owner of the range. This groups ranges together
so that userspace programs can identify the ranges they
created and ignore those created by others.
The kernel returns this string back in the output of
@stats_list message, but it doesn't use it for anything else.
If we omit the number of optional arguments, program id must not
be a number, otherwise it would be interpreted as the number of
optional arguments.
<aux_data>
An optional parameter. A word that provides auxiliary data
that is useful to the client program that created the range.
The kernel returns this string back in the output of
@stats_list message, but it doesn't use this value for anything.
@stats_delete <region_id>
Delete the region with the specified id.
<region_id>
region_id returned from @stats_create
@stats_clear <region_id>
Clear all the counters except the in-flight i/o counters.
<region_id>
region_id returned from @stats_create
@stats_list [<program_id>]
List all regions registered with @stats_create.
<program_id>
An optional parameter.
If this parameter is specified, only matching regions
are returned.
If it is not specified, all regions are returned.
Output format:
<region_id>: <start_sector>+<length> <step> <program_id> <aux_data>
precise_timestamps histogram:n1,n2,n3,...
The strings "precise_timestamps" and "histogram" are printed only
if they were specified when creating the region.
@stats_print <region_id> [<starting_line> <number_of_lines>]
Print counters for each step-sized area of a region.
<region_id>
region_id returned from @stats_create
<starting_line>
The index of the starting line in the output.
If omitted, all lines are returned.
<number_of_lines>
The number of lines to include in the output.
If omitted, all lines are returned.
Output format for each step-sized area of a region:
<start_sector>+<length>
counters
The first 11 counters have the same meaning as
`/sys/block/*/stat or /proc/diskstats`.
Please refer to Documentation/admin-guide/iostats.rst for details.
1. the number of reads completed
2. the number of reads merged
3. the number of sectors read
4. the number of milliseconds spent reading
5. the number of writes completed
6. the number of writes merged
7. the number of sectors written
8. the number of milliseconds spent writing
9. the number of I/Os currently in progress
10. the number of milliseconds spent doing I/Os
11. the weighted number of milliseconds spent doing I/Os
Additional counters:
12. the total time spent reading in milliseconds
13. the total time spent writing in milliseconds
@stats_print_clear <region_id> [<starting_line> <number_of_lines>]
Atomically print and then clear all the counters except the
in-flight i/o counters. Useful when the client consuming the
statistics does not want to lose any statistics (those updated
between printing and clearing).
<region_id>
region_id returned from @stats_create
<starting_line>
The index of the starting line in the output.
If omitted, all lines are printed and then cleared.
<number_of_lines>
The number of lines to process.
If omitted, all lines are printed and then cleared.
@stats_set_aux <region_id> <aux_data>
Store auxiliary data aux_data for the specified region.
<region_id>
region_id returned from @stats_create
<aux_data>
The string that identifies data which is useful to the client
program that created the range. The kernel returns this
string back in the output of @stats_list message, but it
doesn't use this value for anything.
Examples
========
Subdivide the DM device 'vol' into 100 pieces and start collecting
statistics on them::
dmsetup message vol 0 @stats_create - /100
Set the auxiliary data string to "foo bar baz" (the escape for each
space must also be escaped, otherwise the shell will consume them)::
dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz
List the statistics::
dmsetup message vol 0 @stats_list
Print the statistics::
dmsetup message vol 0 @stats_print 0
Delete the statistics::
dmsetup message vol 0 @stats_delete 0

View File

@@ -0,0 +1,61 @@
=========
dm-stripe
=========
Device-Mapper's "striped" target is used to create a striped (i.e. RAID-0)
device across one or more underlying devices. Data is written in "chunks",
with consecutive chunks rotating among the underlying devices. This can
potentially provide improved I/O throughput by utilizing several physical
devices in parallel.
Parameters: <num devs> <chunk size> [<dev path> <offset>]+
<num devs>:
Number of underlying devices.
<chunk size>:
Size of each chunk of data. Must be at least as
large as the system's PAGE_SIZE.
<dev path>:
Full pathname to the underlying block-device, or a
"major:minor" device-number.
<offset>:
Starting sector within the device.
One or more underlying devices can be specified. The striped device size must
be a multiple of the chunk size multiplied by the number of underlying devices.
Example scripts
===============
::
#!/usr/bin/perl -w
# Create a striped device across any number of underlying devices. The device
# will be called "stripe_dev" and have a chunk-size of 128k.
my $chunk_size = 128 * 2;
my $dev_name = "stripe_dev";
my $num_devs = @ARGV;
my @devs = @ARGV;
my ($min_dev_size, $stripe_dev_size, $i);
if (!$num_devs) {
die("Specify at least one device\n");
}
$min_dev_size = `blockdev --getsz $devs[0]`;
for ($i = 1; $i < $num_devs; $i++) {
my $this_size = `blockdev --getsz $devs[$i]`;
$min_dev_size = ($min_dev_size < $this_size) ?
$min_dev_size : $this_size;
}
$stripe_dev_size = $min_dev_size * $num_devs;
$stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs);
$table = "0 $stripe_dev_size striped $num_devs $chunk_size";
for ($i = 0; $i < $num_devs; $i++) {
$table .= " $devs[$i] 0";
}
`echo $table | dmsetup create $dev_name`;

View File

@@ -0,0 +1,141 @@
=========
dm-switch
=========
The device-mapper switch target creates a device that supports an
arbitrary mapping of fixed-size regions of I/O across a fixed set of
paths. The path used for any specific region can be switched
dynamically by sending the target a message.
It maps I/O to underlying block devices efficiently when there is a large
number of fixed-sized address regions but there is no simple pattern
that would allow for a compact representation of the mapping such as
dm-stripe.
Background
----------
Dell EqualLogic and some other iSCSI storage arrays use a distributed
frameless architecture. In this architecture, the storage group
consists of a number of distinct storage arrays ("members") each having
independent controllers, disk storage and network adapters. When a LUN
is created it is spread across multiple members. The details of the
spreading are hidden from initiators connected to this storage system.
The storage group exposes a single target discovery portal, no matter
how many members are being used. When iSCSI sessions are created, each
session is connected to an eth port on a single member. Data to a LUN
can be sent on any iSCSI session, and if the blocks being accessed are
stored on another member the I/O will be forwarded as required. This
forwarding is invisible to the initiator. The storage layout is also
dynamic, and the blocks stored on disk may be moved from member to
member as needed to balance the load.
This architecture simplifies the management and configuration of both
the storage group and initiators. In a multipathing configuration, it
is possible to set up multiple iSCSI sessions to use multiple network
interfaces on both the host and target to take advantage of the
increased network bandwidth. An initiator could use a simple round
robin algorithm to send I/O across all paths and let the storage array
members forward it as necessary, but there is a performance advantage to
sending data directly to the correct member.
A device-mapper table already lets you map different regions of a
device onto different targets. However in this architecture the LUN is
spread with an address region size on the order of 10s of MBs, which
means the resulting table could have more than a million entries and
consume far too much memory.
Using this device-mapper switch target we can now build a two-layer
device hierarchy:
Upper Tier - Determine which array member the I/O should be sent to.
Lower Tier - Load balance amongst paths to a particular member.
The lower tier consists of a single dm multipath device for each member.
Each of these multipath devices contains the set of paths directly to
the array member in one priority group, and leverages existing path
selectors to load balance amongst these paths. We also build a
non-preferred priority group containing paths to other array members for
failover reasons.
The upper tier consists of a single dm-switch device. This device uses
a bitmap to look up the location of the I/O and choose the appropriate
lower tier device to route the I/O. By using a bitmap we are able to
use 4 bits for each address range in a 16 member group (which is very
large for us). This is a much denser representation than the dm table
b-tree can achieve.
Construction Parameters
=======================
<num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+
<num_paths>
The number of paths across which to distribute the I/O.
<region_size>
The number of 512-byte sectors in a region. Each region can be redirected
to any of the available paths.
<num_optional_args>
The number of optional arguments. Currently, no optional arguments
are supported and so this must be zero.
<dev_path>
The block device that represents a specific path to the device.
<offset>
The offset of the start of data on the specific <dev_path> (in units
of 512-byte sectors). This number is added to the sector number when
forwarding the request to the specific path. Typically it is zero.
Messages
========
set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>...
Modify the region table by specifying which regions are redirected to
which paths.
<index>
The region number (region size was specified in constructor parameters).
If index is omitted, the next region (previous index + 1) is used.
Expressed in hexadecimal (WITHOUT any prefix like 0x).
<path_nr>
The path number in the range 0 ... (<num_paths> - 1).
Expressed in hexadecimal (WITHOUT any prefix like 0x).
R<n>,<m>
This parameter allows repetitive patterns to be loaded quickly. <n> and <m>
are hexadecimal numbers. The last <n> mappings are repeated in the next <m>
slots.
Status
======
No status line is reported.
Example
=======
Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
the same size.
Create a switch device with 64kB region size::
dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0`
switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
Set mappings for the first 7 entries to point to devices switch0, switch1,
switch2, switch0, switch1, switch2, switch1::
dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
Set repetitive mapping. This command::
dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
is equivalent to::
dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
:1 :2 :1 :2 :1 :2 :1 :2 :1 :2

View File

@@ -0,0 +1,427 @@
=================
Thin provisioning
=================
Introduction
============
This document describes a collection of device-mapper targets that
between them implement thin-provisioning and snapshots.
The main highlight of this implementation, compared to the previous
implementation of snapshots, is that it allows many virtual devices to
be stored on the same data volume. This simplifies administration and
allows the sharing of data between volumes, thus reducing disk usage.
Another significant feature is support for an arbitrary depth of
recursive snapshots (snapshots of snapshots of snapshots ...). The
previous implementation of snapshots did this by chaining together
lookup tables, and so performance was O(depth). This new
implementation uses a single data structure to avoid this degradation
with depth. Fragmentation may still be an issue, however, in some
scenarios.
Metadata is stored on a separate device from data, giving the
administrator some freedom, for example to:
- Improve metadata resilience by storing metadata on a mirrored volume
but data on a non-mirrored one.
- Improve performance by storing the metadata on SSD.
Status
======
These targets are considered safe for production use. But different use
cases will have different performance characteristics, for example due
to fragmentation of the data volume.
If you find this software is not performing as expected please mail
dm-devel@redhat.com with details and we'll try our best to improve
things for you.
Userspace tools for checking and repairing the metadata have been fully
developed and are available as 'thin_check' and 'thin_repair'. The name
of the package that provides these utilities varies by distribution (on
a Red Hat distribution it is named 'device-mapper-persistent-data').
Cookbook
========
This section describes some quick recipes for using thin provisioning.
They use the dmsetup program to control the device-mapper driver
directly. End users will be advised to use a higher-level volume
manager such as LVM2 once support has been added.
Pool device
-----------
The pool device ties together the metadata volume and the data volume.
It maps I/O linearly to the data volume and updates the metadata via
two mechanisms:
- Function calls from the thin targets
- Device-mapper 'messages' from userspace which control the creation of new
virtual devices amongst other things.
Setting up a fresh pool device
------------------------------
Setting up a pool device requires a valid metadata device, and a
data device. If you do not have an existing metadata device you can
make one by zeroing the first 4k to indicate empty metadata.
dd if=/dev/zero of=$metadata_dev bs=4096 count=1
The amount of metadata you need will vary according to how many blocks
are shared between thin devices (i.e. through snapshots). If you have
less sharing than average you'll need a larger-than-average metadata device.
As a guide, we suggest you calculate the number of bytes to use in the
metadata device as 48 * $data_dev_size / $data_block_size but round it up
to 2MB if the answer is smaller. If you're creating large numbers of
snapshots which are recording large amounts of change, you may find you
need to increase this.
The largest size supported is 16GB: If the device is larger,
a warning will be issued and the excess space will not be used.
Reloading a pool table
----------------------
You may reload a pool's table, indeed this is how the pool is resized
if it runs out of space. (N.B. While specifying a different metadata
device when reloading is not forbidden at the moment, things will go
wrong if it does not route I/O to exactly the same on-disk location as
previously.)
Using an existing pool device
-----------------------------
::
dmsetup create pool \
--table "0 20971520 thin-pool $metadata_dev $data_dev \
$data_block_size $low_water_mark"
$data_block_size gives the smallest unit of disk space that can be
allocated at a time expressed in units of 512-byte sectors.
$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a
multiple of 128 (64KB). $data_block_size cannot be changed after the
thin-pool is created. People primarily interested in thin provisioning
may want to use a value such as 1024 (512KB). People doing lots of
snapshotting may want a smaller value such as 128 (64KB). If you are
not zeroing newly-allocated data, a larger $data_block_size in the
region of 256000 (128MB) is suggested.
$low_water_mark is expressed in blocks of size $data_block_size. If
free space on the data device drops below this level then a dm event
will be triggered which a userspace daemon should catch allowing it to
extend the pool device. Only one such event will be sent.
No special event is triggered if a just resumed device's free space is below
the low water mark. However, resuming a device always triggers an
event; a userspace daemon should verify that free space exceeds the low
water mark when handling this event.
A low water mark for the metadata device is maintained in the kernel and
will trigger a dm event if free space on the metadata device drops below
it.
Updating on-disk metadata
-------------------------
On-disk metadata is committed every time a FLUSH or FUA bio is written.
If no such requests are made then commits will occur every second. This
means the thin-provisioning target behaves like a physical disk that has
a volatile write cache. If power is lost you may lose some recent
writes. The metadata should always be consistent in spite of any crash.
If data space is exhausted the pool will either error or queue IO
according to the configuration (see: error_if_no_space). If metadata
space is exhausted or a metadata operation fails: the pool will error IO
until the pool is taken offline and repair is performed to 1) fix any
potential inconsistencies and 2) clear the flag that imposes repair.
Once the pool's metadata device is repaired it may be resized, which
will allow the pool to return to normal operation. Note that if a pool
is flagged as needing repair, the pool's data and metadata devices
cannot be resized until repair is performed. It should also be noted
that when the pool's metadata space is exhausted the current metadata
transaction is aborted. Given that the pool will cache IO whose
completion may have already been acknowledged to upper IO layers
(e.g. filesystem) it is strongly suggested that consistency checks
(e.g. fsck) be performed on those layers when repair of the pool is
required.
Thin provisioning
-----------------
i) Creating a new thinly-provisioned volume.
To create a new thinly- provisioned volume you must send a message to an
active pool device, /dev/mapper/pool in this example::
dmsetup message /dev/mapper/pool 0 "create_thin 0"
Here '0' is an identifier for the volume, a 24-bit number. It's up
to the caller to allocate and manage these identifiers. If the
identifier is already in use, the message will fail with -EEXIST.
ii) Using a thinly-provisioned volume.
Thinly-provisioned volumes are activated using the 'thin' target::
dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0"
The last parameter is the identifier for the thinp device.
Internal snapshots
------------------
i) Creating an internal snapshot.
Snapshots are created with another message to the pool.
N.B. If the origin device that you wish to snapshot is active, you
must suspend it before creating the snapshot to avoid corruption.
This is NOT enforced at the moment, so please be careful!
::
dmsetup suspend /dev/mapper/thin
dmsetup message /dev/mapper/pool 0 "create_snap 1 0"
dmsetup resume /dev/mapper/thin
Here '1' is the identifier for the volume, a 24-bit number. '0' is the
identifier for the origin device.
ii) Using an internal snapshot.
Once created, the user doesn't have to worry about any connection
between the origin and the snapshot. Indeed the snapshot is no
different from any other thinly-provisioned device and can be
snapshotted itself via the same method. It's perfectly legal to
have only one of them active, and there's no ordering requirement on
activating or removing them both. (This differs from conventional
device-mapper snapshots.)
Activate it exactly the same way as any other thinly-provisioned volume::
dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
External snapshots
------------------
You can use an external **read only** device as an origin for a
thinly-provisioned volume. Any read to an unprovisioned area of the
thin device will be passed through to the origin. Writes trigger
the allocation of new blocks as usual.
One use case for this is VM hosts that want to run guests on
thinly-provisioned volumes but have the base image on another device
(possibly shared between many VMs).
You must not write to the origin device if you use this technique!
Of course, you may write to the thin device and take internal snapshots
of the thin volume.
i) Creating a snapshot of an external device
This is the same as creating a thin device.
You don't mention the origin at this stage.
::
dmsetup message /dev/mapper/pool 0 "create_thin 0"
ii) Using a snapshot of an external device.
Append an extra parameter to the thin target specifying the origin::
dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image"
N.B. All descendants (internal snapshots) of this snapshot require the
same extra origin parameter.
Deactivation
------------
All devices using a pool must be deactivated before the pool itself
can be.
::
dmsetup remove thin
dmsetup remove snap
dmsetup remove pool
Reference
=========
'thin-pool' target
------------------
i) Constructor
::
thin-pool <metadata dev> <data dev> <data block size (sectors)> \
<low water mark (blocks)> [<number of feature args> [<arg>]*]
Optional feature arguments:
skip_block_zeroing:
Skip the zeroing of newly-provisioned blocks.
ignore_discard:
Disable discard support.
no_discard_passdown:
Don't pass discards down to the underlying
data device, but just remove the mapping.
read_only:
Don't allow any changes to be made to the pool
metadata. This mode is only available after the
thin-pool has been created and first used in full
read/write mode. It cannot be specified on initial
thin-pool creation.
error_if_no_space:
Error IOs, instead of queueing, if no space.
Data block size must be between 64KB (128 sectors) and 1GB
(2097152 sectors) inclusive.
ii) Status
::
<transaction id> <used metadata blocks>/<total metadata blocks>
<used data blocks>/<total data blocks> <held metadata root>
ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
needs_check|- metadata_low_watermark
transaction id:
A 64-bit number used by userspace to help synchronise with metadata
from volume managers.
used data blocks / total data blocks
If the number of free blocks drops below the pool's low water mark a
dm event will be sent to userspace. This event is edge-triggered and
it will occur only once after each resume so volume manager writers
should register for the event and then check the target's status.
held metadata root:
The location, in blocks, of the metadata root that has been
'held' for userspace read access. '-' indicates there is no
held root.
discard_passdown|no_discard_passdown
Whether or not discards are actually being passed down to the
underlying device. When this is enabled when loading the table,
it can get disabled if the underlying device doesn't support it.
ro|rw|out_of_data_space
If the pool encounters certain types of device failures it will
drop into a read-only metadata mode in which no changes to
the pool metadata (like allocating new blocks) are permitted.
In serious cases where even a read-only mode is deemed unsafe
no further I/O will be permitted and the status will just
contain the string 'Fail'. The userspace recovery tools
should then be used.
error_if_no_space|queue_if_no_space
If the pool runs out of data or metadata space, the pool will
either queue or error the IO destined to the data device. The
default is to queue the IO until more space is added or the
'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool
module parameter can be used to change this timeout -- it
defaults to 60 seconds but may be disabled using a value of 0.
needs_check
A metadata operation has failed, resulting in the needs_check
flag being set in the metadata's superblock. The metadata
device must be deactivated and checked/repaired before the
thin-pool can be made fully operational again. '-' indicates
needs_check is not set.
metadata_low_watermark:
Value of metadata low watermark in blocks. The kernel sets this
value internally but userspace needs to know this value to
determine if an event was caused by crossing this threshold.
iii) Messages
create_thin <dev id>
Create a new thinly-provisioned device.
<dev id> is an arbitrary unique 24-bit identifier chosen by
the caller.
create_snap <dev id> <origin id>
Create a new snapshot of another thinly-provisioned device.
<dev id> is an arbitrary unique 24-bit identifier chosen by
the caller.
<origin id> is the identifier of the thinly-provisioned device
of which the new device will be a snapshot.
delete <dev id>
Deletes a thin device. Irreversible.
set_transaction_id <current id> <new id>
Userland volume managers, such as LVM, need a way to
synchronise their external metadata with the internal metadata of the
pool target. The thin-pool target offers to store an
arbitrary 64-bit transaction id and return it on the target's
status line. To avoid races you must provide what you think
the current transaction id is when you change it with this
compare-and-swap message.
reserve_metadata_snap
Reserve a copy of the data mapping btree for use by userland.
This allows userland to inspect the mappings as they were when
this message was executed. Use the pool's status command to
get the root block associated with the metadata snapshot.
release_metadata_snap
Release a previously reserved copy of the data mapping btree.
'thin' target
-------------
i) Constructor
::
thin <pool dev> <dev id> [<external origin dev>]
pool dev:
the thin-pool device, e.g. /dev/mapper/my_pool or 253:0
dev id:
the internal device identifier of the device to be
activated.
external origin dev:
an optional block device outside the pool to be treated as a
read-only snapshot origin: reads to unprovisioned areas of the
thin target will be mapped to this device.
The pool doesn't store any size against the thin devices. If you
load a thin target that is smaller than you've been using previously,
then you'll have no access to blocks mapped beyond the end. If you
load a target that is bigger than before, then extra blocks will be
provisioned as and when needed.
ii) Status
<nr mapped sectors> <highest mapped sector>
If the pool has encountered device errors and failed, the status
will just contain the string 'Fail'. The userspace recovery
tools should then be used.
In the case where <nr mapped sectors> is 0, there is no highest
mapped sector and the value of <highest mapped sector> is unspecified.

View File

@@ -0,0 +1,135 @@
================================
Device-mapper "unstriped" target
================================
Introduction
============
The device-mapper "unstriped" target provides a transparent mechanism to
unstripe a device-mapper "striped" target to access the underlying disks
without having to touch the true backing block-device. It can also be
used to unstripe a hardware RAID-0 to access backing disks.
Parameters:
<number of stripes> <chunk size> <stripe #> <dev_path> <offset>
<number of stripes>
The number of stripes in the RAID 0.
<chunk size>
The amount of 512B sectors in the chunk striping.
<dev_path>
The block device you wish to unstripe.
<stripe #>
The stripe number within the device that corresponds to physical
drive you wish to unstripe. This must be 0 indexed.
Why use this module?
====================
An example of undoing an existing dm-stripe
-------------------------------------------
This small bash script will setup 4 loop devices and use the existing
striped target to combine the 4 devices into one. It then will use
the unstriped target ontop of the striped device to access the
individual backing loop devices. We write data to the newly exposed
unstriped devices and verify the data written matches the correct
underlying device on the striped array::
#!/bin/bash
MEMBER_SIZE=$((128 * 1024 * 1024))
NUM=4
SEQ_END=$((${NUM}-1))
CHUNK=256
BS=4096
RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
COUNT=$((${MEMBER_SIZE} / ${BS}))
for i in $(seq 0 ${SEQ_END}); do
dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
losetup /dev/loop${i} member-${i}
DM_PARMS+=" /dev/loop${i} 0"
done
echo $DM_PARMS | dmsetup create raid0
for i in $(seq 0 ${SEQ_END}); do
echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i}
done;
for i in $(seq 0 ${SEQ_END}); do
dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct
diff /dev/mapper/set-${i} member-${i}
done;
for i in $(seq 0 ${SEQ_END}); do
dmsetup remove set-${i}
done
dmsetup remove raid0
for i in $(seq 0 ${SEQ_END}); do
losetup -d /dev/loop${i}
rm -f member-${i}
done
Another example
---------------
Intel NVMe drives contain two cores on the physical device.
Each core of the drive has segregated access to its LBA range.
The current LBA model has a RAID 0 128k chunk on each core, resulting
in a 256k stripe across the two cores::
Core 0: Core 1:
__________ __________
| LBA 512| | LBA 768|
| LBA 0 | | LBA 256|
---------- ----------
The purpose of this unstriping is to provide better QoS in noisy
neighbor environments. When two partitions are created on the
aggregate drive without this unstriping, reads on one partition
can affect writes on another partition. This is because the partitions
are striped across the two cores. When we unstripe this hardware RAID 0
and make partitions on each new exposed device the two partitions are now
physically separated.
With the dm-unstriped target we're able to segregate an fio script that
has read and write jobs that are independent of each other. Compared to
when we run the test on a combined drive with partitions, we were able
to get a 92% reduction in read latency using this device mapper target.
Example dmsetup usage
=====================
unstriped ontop of Intel NVMe device that has 2 cores
-----------------------------------------------------
::
dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0'
There will now be two devices that expose Intel NVMe core 0 and 1
respectively::
/dev/mapper/nvmset0
/dev/mapper/nvmset1
unstriped ontop of striped with 4 drives using 128K chunk size
--------------------------------------------------------------
::
dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0'
dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'

View File

@@ -0,0 +1,229 @@
=========
dm-verity
=========
Device-Mapper's "verity" target provides transparent integrity checking of
block devices using a cryptographic digest provided by the kernel crypto API.
This target is read-only.
Construction Parameters
=======================
::
<version> <dev> <hash_dev>
<data_block_size> <hash_block_size>
<num_data_blocks> <hash_start_block>
<algorithm> <digest> <salt>
[<#opt_params> <opt_params>]
<version>
This is the type of the on-disk hash format.
0 is the original format used in the Chromium OS.
The salt is appended when hashing, digests are stored continuously and
the rest of the block is padded with zeroes.
1 is the current format that should be used for new devices.
The salt is prepended when hashing and each digest is
padded with zeroes to the power of two.
<dev>
This is the device containing data, the integrity of which needs to be
checked. It may be specified as a path, like /dev/sdaX, or a device number,
<major>:<minor>.
<hash_dev>
This is the device that supplies the hash tree data. It may be
specified similarly to the device path and may be the same device. If the
same device is used, the hash_start should be outside the configured
dm-verity device.
<data_block_size>
The block size on a data device in bytes.
Each block corresponds to one digest on the hash device.
<hash_block_size>
The size of a hash block in bytes.
<num_data_blocks>
The number of data blocks on the data device. Additional blocks are
inaccessible. You can place hashes to the same partition as data, in this
case hashes are placed after <num_data_blocks>.
<hash_start_block>
This is the offset, in <hash_block_size>-blocks, from the start of hash_dev
to the root block of the hash tree.
<algorithm>
The cryptographic hash algorithm used for this device. This should
be the name of the algorithm, like "sha1".
<digest>
The hexadecimal encoding of the cryptographic hash of the root hash block
and the salt. This hash should be trusted as there is no other authenticity
beyond this point.
<salt>
The hexadecimal encoding of the salt value.
<#opt_params>
Number of optional parameters. If there are no optional parameters,
the optional paramaters section can be skipped or #opt_params can be zero.
Otherwise #opt_params is the number of following arguments.
Example of optional parameters section:
1 ignore_corruption
ignore_corruption
Log corrupted blocks, but allow read operations to proceed normally.
restart_on_corruption
Restart the system when a corrupted block is discovered. This option is
not compatible with ignore_corruption and requires user space support to
avoid restart loops.
ignore_zero_blocks
Do not verify blocks that are expected to contain zeroes and always return
zeroes instead. This may be useful if the partition contains unused blocks
that are not guaranteed to contain zeroes.
use_fec_from_device <fec_dev>
Use forward error correction (FEC) to recover from corruption if hash
verification fails. Use encoding data from the specified device. This
may be the same device where data and hash blocks reside, in which case
fec_start must be outside data and hash areas.
If the encoding data covers additional metadata, it must be accessible
on the hash device after the hash blocks.
Note: block sizes for data and hash devices must match. Also, if the
verity <dev> is encrypted the <fec_dev> should be too.
fec_roots <num>
Number of generator roots. This equals to the number of parity bytes in
the encoding data. For example, in RS(M, N) encoding, the number of roots
is M-N.
fec_blocks <num>
The number of encoding data blocks on the FEC device. The block size for
the FEC device is <data_block_size>.
fec_start <offset>
This is the offset, in <data_block_size> blocks, from the start of the
FEC device to the beginning of the encoding data.
check_at_most_once
Verify data blocks only the first time they are read from the data device,
rather than every time. This reduces the overhead of dm-verity so that it
can be used on systems that are memory and/or CPU constrained. However, it
provides a reduced level of security because only offline tampering of the
data device's content will be detected, not online tampering.
Hash blocks are still verified each time they are read from the hash device,
since verification of hash blocks is less performance critical than data
blocks, and a hash block will not be verified any more after all the data
blocks it covers have been verified anyway.
Theory of operation
===================
dm-verity is meant to be set up as part of a verified boot path. This
may be anything ranging from a boot using tboot or trustedgrub to just
booting from a known-good device (like a USB drive or CD).
When a dm-verity device is configured, it is expected that the caller
has been authenticated in some way (cryptographic signatures, etc).
After instantiation, all hashes will be verified on-demand during
disk access. If they cannot be verified up to the root node of the
tree, the root hash, then the I/O will fail. This should detect
tampering with any data on the device and the hash data.
Cryptographic hashes are used to assert the integrity of the device on a
per-block basis. This allows for a lightweight hash computation on first read
into the page cache. Block hashes are stored linearly, aligned to the nearest
block size.
If forward error correction (FEC) support is enabled any recovery of
corrupted data will be verified using the cryptographic hash of the
corresponding data. This is why combining error correction with
integrity checking is essential.
Hash Tree
---------
Each node in the tree is a cryptographic hash. If it is a leaf node, the hash
of some data block on disk is calculated. If it is an intermediary node,
the hash of a number of child nodes is calculated.
Each entry in the tree is a collection of neighboring nodes that fit in one
block. The number is determined based on block_size and the size of the
selected cryptographic digest algorithm. The hashes are linearly-ordered in
this entry and any unaligned trailing space is ignored but included when
calculating the parent node.
The tree looks something like:
alg = sha256, num_blocks = 32768, block_size = 4096
::
[ root ]
/ . . . \
[entry_0] [entry_1]
/ . . . \ . . . \
[entry_0_0] . . . [entry_0_127] . . . . [entry_1_127]
/ ... \ / . . . \ / \
blk_0 ... blk_127 blk_16256 blk_16383 blk_32640 . . . blk_32767
On-disk format
==============
The verity kernel code does not read the verity metadata on-disk header.
It only reads the hash blocks which directly follow the header.
It is expected that a user-space tool will verify the integrity of the
verity header.
Alternatively, the header can be omitted and the dmsetup parameters can
be passed via the kernel command-line in a rooted chain of trust where
the command-line is verified.
Directly following the header (and with sector number padded to the next hash
block boundary) are the hash blocks which are stored a depth at a time
(starting from the root), sorted in order of increasing index.
The full specification of kernel parameters and on-disk metadata format
is available at the cryptsetup project's wiki page
https://gitlab.com/cryptsetup/cryptsetup/wikis/DMVerity
Status
======
V (for Valid) is returned if every check performed so far was valid.
If any check failed, C (for Corruption) is returned.
Example
=======
Set up a device::
# dmsetup create vroot --readonly --table \
"0 2097152 verity 1 /dev/sda1 /dev/sda2 4096 4096 262144 1 sha256 "\
"4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 "\
"1234000000000000000000000000000000000000000000000000000000000000"
A command line tool veritysetup is available to compute or verify
the hash tree or activate the kernel device. This is available from
the cryptsetup upstream repository https://gitlab.com/cryptsetup/cryptsetup/
(as a libcryptsetup extension).
Create hash on the device::
# veritysetup format /dev/sda1 /dev/sda2
...
Root hash: 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076
Activate the device::
# veritysetup create vroot /dev/sda1 /dev/sda2 \
4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076

View File

@@ -0,0 +1,79 @@
=================
Writecache target
=================
The writecache target caches writes on persistent memory or on SSD. It
doesn't cache reads because reads are supposed to be cached in page cache
in normal RAM.
When the device is constructed, the first sector should be zeroed or the
first sector should contain valid superblock from previous invocation.
Constructor parameters:
1. type of the cache device - "p" or "s"
- p - persistent memory
- s - SSD
2. the underlying device that will be cached
3. the cache device
4. block size (4096 is recommended; the maximum block size is the page
size)
5. the number of optional parameters (the parameters with an argument
count as two)
start_sector n (default: 0)
offset from the start of cache device in 512-byte sectors
high_watermark n (default: 50)
start writeback when the number of used blocks reach this
watermark
low_watermark x (default: 45)
stop writeback when the number of used blocks drops below
this watermark
writeback_jobs n (default: unlimited)
limit the number of blocks that are in flight during
writeback. Setting this value reduces writeback
throughput, but it may improve latency of read requests
autocommit_blocks n (default: 64 for pmem, 65536 for ssd)
when the application writes this amount of blocks without
issuing the FLUSH request, the blocks are automatically
commited
autocommit_time ms (default: 1000)
autocommit time in milliseconds. The data is automatically
commited if this time passes and no FLUSH request is
received
fua (by default on)
applicable only to persistent memory - use the FUA flag
when writing data from persistent memory back to the
underlying device
nofua
applicable only to persistent memory - don't use the FUA
flag when writing back data and send the FLUSH request
afterwards
- some underlying devices perform better with fua, some
with nofua. The user should test it
Status:
1. error indicator - 0 if there was no error, otherwise error number
2. the number of blocks
3. the number of free blocks
4. the number of blocks under writeback
Messages:
flush
flush the cache device. The message returns successfully
if the cache device was flushed without an error
flush_on_suspend
flush the cache device on next suspend. Use this message
when you are going to remove the cache device. The proper
sequence for removing the cache device is:
1. send the "flush_on_suspend" message
2. load an inactive table with a linear target that maps
to the underlying device
3. suspend the device
4. ask for status and verify that there are no errors
5. resume the device, so that it will use the linear
target
6. the cache device is now inactive and it can be deleted

View File

@@ -0,0 +1,37 @@
=======
dm-zero
=======
Device-Mapper's "zero" target provides a block-device that always returns
zero'd data on reads and silently drops writes. This is similar behavior to
/dev/zero, but as a block-device instead of a character-device.
Dm-zero has no target-specific parameters.
One very interesting use of dm-zero is for creating "sparse" devices in
conjunction with dm-snapshot. A sparse device reports a device-size larger
than the amount of actual storage space available for that device. A user can
write data anywhere within the sparse device and read it back like a normal
device. Reads to previously unwritten areas will return a zero'd buffer. When
enough data has been written to fill up the actual storage space, the sparse
device is deactivated. This can be very useful for testing device and
filesystem limitations.
To create a sparse device, start by creating a dm-zero device that's the
desired size of the sparse device. For this example, we'll assume a 10TB
sparse device::
TEN_TERABYTES=`expr 10 \* 1024 \* 1024 \* 1024 \* 2` # 10 TB in sectors
echo "0 $TEN_TERABYTES zero" | dmsetup create zero1
Then create a snapshot of the zero device, using any available block-device as
the COW device. The size of the COW device will determine the amount of real
space available to the sparse device. For this example, we'll assume /dev/sdb1
is an available 10GB partition::
echo "0 $TEN_TERABYTES snapshot /dev/mapper/zero1 /dev/sdb1 p 128" | \
dmsetup create sparse1
This will create a 10TB sparse device called /dev/mapper/sparse1 that has
10GB of actual storage space available. If more than 10GB of data is written
to this device, it will start returning I/O errors.

View File

@@ -2693,8 +2693,8 @@
41 = /dev/ttySMX0 Motorola i.MX - port 0
42 = /dev/ttySMX1 Motorola i.MX - port 1
43 = /dev/ttySMX2 Motorola i.MX - port 2
44 = /dev/ttyMM0 Marvell MPSC - port 0
45 = /dev/ttyMM1 Marvell MPSC - port 1
44 = /dev/ttyMM0 Marvell MPSC - port 0 (obsolete unused)
45 = /dev/ttyMM1 Marvell MPSC - port 1 (obsolete unused)
46 = /dev/ttyCPM0 PPC CPM (SCC or SMC) - port 0
...
47 = /dev/ttyCPM5 PPC CPM (SCC or SMC) - port 5

View File

@@ -0,0 +1,100 @@
=================
The EFI Boot Stub
=================
On the x86 and ARM platforms, a kernel zImage/bzImage can masquerade
as a PE/COFF image, thereby convincing EFI firmware loaders to load
it as an EFI executable. The code that modifies the bzImage header,
along with the EFI-specific entry point that the firmware loader
jumps to are collectively known as the "EFI boot stub", and live in
arch/x86/boot/header.S and arch/x86/boot/compressed/eboot.c,
respectively. For ARM the EFI stub is implemented in
arch/arm/boot/compressed/efi-header.S and
arch/arm/boot/compressed/efi-stub.c. EFI stub code that is shared
between architectures is in drivers/firmware/efi/libstub.
For arm64, there is no compressed kernel support, so the Image itself
masquerades as a PE/COFF image and the EFI stub is linked into the
kernel. The arm64 EFI stub lives in arch/arm64/kernel/efi-entry.S
and drivers/firmware/efi/libstub/arm64-stub.c.
By using the EFI boot stub it's possible to boot a Linux kernel
without the use of a conventional EFI boot loader, such as grub or
elilo. Since the EFI boot stub performs the jobs of a boot loader, in
a certain sense it *IS* the boot loader.
The EFI boot stub is enabled with the CONFIG_EFI_STUB kernel option.
How to install bzImage.efi
--------------------------
The bzImage located in arch/x86/boot/bzImage must be copied to the EFI
System Partition (ESP) and renamed with the extension ".efi". Without
the extension the EFI firmware loader will refuse to execute it. It's
not possible to execute bzImage.efi from the usual Linux file systems
because EFI firmware doesn't have support for them. For ARM the
arch/arm/boot/zImage should be copied to the system partition, and it
may not need to be renamed. Similarly for arm64, arch/arm64/boot/Image
should be copied but not necessarily renamed.
Passing kernel parameters from the EFI shell
--------------------------------------------
Arguments to the kernel can be passed after bzImage.efi, e.g.::
fs0:> bzImage.efi console=ttyS0 root=/dev/sda4
The "initrd=" option
--------------------
Like most boot loaders, the EFI stub allows the user to specify
multiple initrd files using the "initrd=" option. This is the only EFI
stub-specific command line parameter, everything else is passed to the
kernel when it boots.
The path to the initrd file must be an absolute path from the
beginning of the ESP, relative path names do not work. Also, the path
is an EFI-style path and directory elements must be separated with
backslashes (\). For example, given the following directory layout::
fs0:>
Kernels\
bzImage.efi
initrd-large.img
Ramdisks\
initrd-small.img
initrd-medium.img
to boot with the initrd-large.img file if the current working
directory is fs0:\Kernels, the following command must be used::
fs0:\Kernels> bzImage.efi initrd=\Kernels\initrd-large.img
Notice how bzImage.efi can be specified with a relative path. That's
because the image we're executing is interpreted by the EFI shell,
which understands relative paths, whereas the rest of the command line
is passed to bzImage.efi.
The "dtb=" option
-----------------
For the ARM and arm64 architectures, a device tree must be provided to
the kernel. Normally firmware shall supply the device tree via the
EFI CONFIGURATION TABLE. However, the "dtb=" command line option can
be used to override the firmware supplied device tree, or to supply
one when firmware is unable to.
Please note: Firmware adds runtime configuration information to the
device tree before booting the kernel. If dtb= is used to override
the device tree, then any runtime data provided by firmware will be
lost. The dtb= option should only be used either as a debug tool, or
as a last resort when a device tree is not provided in the EFI
CONFIGURATION TABLE.
"dtb=" is processed in the same manner as the "initrd=" option that is
described above.

View File

@@ -0,0 +1,17 @@
.. SPDX-License-Identifier: GPL-2.0
====
gpio
====
.. toctree::
:maxdepth: 1
sysfs
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@@ -0,0 +1,167 @@
GPIO Sysfs Interface for Userspace
==================================
.. warning::
THIS ABI IS DEPRECATED, THE ABI DOCUMENTATION HAS BEEN MOVED TO
Documentation/ABI/obsolete/sysfs-gpio AND NEW USERSPACE CONSUMERS
ARE SUPPOSED TO USE THE CHARACTER DEVICE ABI. THIS OLD SYSFS ABI WILL
NOT BE DEVELOPED (NO NEW FEATURES), IT WILL JUST BE MAINTAINED.
Refer to the examples in tools/gpio/* for an introduction to the new
character device ABI. Also see the userspace header in
include/uapi/linux/gpio.h
The deprecated sysfs ABI
------------------------
Platforms which use the "gpiolib" implementors framework may choose to
configure a sysfs user interface to GPIOs. This is different from the
debugfs interface, since it provides control over GPIO direction and
value instead of just showing a gpio state summary. Plus, it could be
present on production systems without debugging support.
Given appropriate hardware documentation for the system, userspace could
know for example that GPIO #23 controls the write protect line used to
protect boot loader segments in flash memory. System upgrade procedures
may need to temporarily remove that protection, first importing a GPIO,
then changing its output state, then updating the code before re-enabling
the write protection. In normal use, GPIO #23 would never be touched,
and the kernel would have no need to know about it.
Again depending on appropriate hardware documentation, on some systems
userspace GPIO can be used to determine system configuration data that
standard kernels won't know about. And for some tasks, simple userspace
GPIO drivers could be all that the system really needs.
DO NOT ABUSE SYSFS TO CONTROL HARDWARE THAT HAS PROPER KERNEL DRIVERS.
PLEASE READ THE DOCUMENT AT Documentation/driver-api/gpio/drivers-on-gpio.rst
TO AVOID REINVENTING KERNEL WHEELS IN USERSPACE. I MEAN IT. REALLY.
Paths in Sysfs
--------------
There are three kinds of entries in /sys/class/gpio:
- Control interfaces used to get userspace control over GPIOs;
- GPIOs themselves; and
- GPIO controllers ("gpio_chip" instances).
That's in addition to standard files including the "device" symlink.
The control interfaces are write-only:
/sys/class/gpio/
"export" ...
Userspace may ask the kernel to export control of
a GPIO to userspace by writing its number to this file.
Example: "echo 19 > export" will create a "gpio19" node
for GPIO #19, if that's not requested by kernel code.
"unexport" ...
Reverses the effect of exporting to userspace.
Example: "echo 19 > unexport" will remove a "gpio19"
node exported using the "export" file.
GPIO signals have paths like /sys/class/gpio/gpio42/ (for GPIO #42)
and have the following read/write attributes:
/sys/class/gpio/gpioN/
"direction" ...
reads as either "in" or "out". This value may
normally be written. Writing as "out" defaults to
initializing the value as low. To ensure glitch free
operation, values "low" and "high" may be written to
configure the GPIO as an output with that initial value.
Note that this attribute *will not exist* if the kernel
doesn't support changing the direction of a GPIO, or
it was exported by kernel code that didn't explicitly
allow userspace to reconfigure this GPIO's direction.
"value" ...
reads as either 0 (low) or 1 (high). If the GPIO
is configured as an output, this value may be written;
any nonzero value is treated as high.
If the pin can be configured as interrupt-generating interrupt
and if it has been configured to generate interrupts (see the
description of "edge"), you can poll(2) on that file and
poll(2) will return whenever the interrupt was triggered. If
you use poll(2), set the events POLLPRI and POLLERR. If you
use select(2), set the file descriptor in exceptfds. After
poll(2) returns, either lseek(2) to the beginning of the sysfs
file and read the new value or close the file and re-open it
to read the value.
"edge" ...
reads as either "none", "rising", "falling", or
"both". Write these strings to select the signal edge(s)
that will make poll(2) on the "value" file return.
This file exists only if the pin can be configured as an
interrupt generating input pin.
"active_low" ...
reads as either 0 (false) or 1 (true). Write
any nonzero value to invert the value attribute both
for reading and writing. Existing and subsequent
poll(2) support configuration via the edge attribute
for "rising" and "falling" edges will follow this
setting.
GPIO controllers have paths like /sys/class/gpio/gpiochip42/ (for the
controller implementing GPIOs starting at #42) and have the following
read-only attributes:
/sys/class/gpio/gpiochipN/
"base" ...
same as N, the first GPIO managed by this chip
"label" ...
provided for diagnostics (not always unique)
"ngpio" ...
how many GPIOs this manages (N to N + ngpio - 1)
Board documentation should in most cases cover what GPIOs are used for
what purposes. However, those numbers are not always stable; GPIOs on
a daughtercard might be different depending on the base board being used,
or other cards in the stack. In such cases, you may need to use the
gpiochip nodes (possibly in conjunction with schematics) to determine
the correct GPIO number to use for a given signal.
Exporting from Kernel code
--------------------------
Kernel code can explicitly manage exports of GPIOs which have already been
requested using gpio_request()::
/* export the GPIO to userspace */
int gpiod_export(struct gpio_desc *desc, bool direction_may_change);
/* reverse gpio_export() */
void gpiod_unexport(struct gpio_desc *desc);
/* create a sysfs link to an exported GPIO node */
int gpiod_export_link(struct device *dev, const char *name,
struct gpio_desc *desc);
After a kernel driver requests a GPIO, it may only be made available in
the sysfs interface by gpiod_export(). The driver can control whether the
signal direction may change. This helps drivers prevent userspace code
from accidentally clobbering important system state.
This explicit exporting can help with debugging (by making some kinds
of experiments easier), or can provide an always-there interface that's
suitable for documenting as part of a board support package.
After the GPIO has been exported, gpiod_export_link() allows creating
symlinks from elsewhere in sysfs to the GPIO sysfs node. Drivers can
use this to provide the interface under their own device in sysfs with
a descriptive name.

View File

@@ -0,0 +1,80 @@
===================================================
Notes on the change from 16-bit UIDs to 32-bit UIDs
===================================================
:Author: Chris Wing <wingc@umich.edu>
:Last updated: January 11, 2000
- kernel code MUST take into account __kernel_uid_t and __kernel_uid32_t
when communicating between user and kernel space in an ioctl or data
structure.
- kernel code should use uid_t and gid_t in kernel-private structures and
code.
What's left to be done for 32-bit UIDs on all Linux architectures:
- Disk quotas have an interesting limitation that is not related to the
maximum UID/GID. They are limited by the maximum file size on the
underlying filesystem, because quota records are written at offsets
corresponding to the UID in question.
Further investigation is needed to see if the quota system can cope
properly with huge UIDs. If it can deal with 64-bit file offsets on all
architectures, this should not be a problem.
- Decide whether or not to keep backwards compatibility with the system
accounting file, or if we should break it as the comments suggest
(currently, the old 16-bit UID and GID are still written to disk, and
part of the former pad space is used to store separate 32-bit UID and
GID)
- Need to validate that OS emulation calls the 16-bit UID
compatibility syscalls, if the OS being emulated used 16-bit UIDs, or
uses the 32-bit UID system calls properly otherwise.
This affects at least:
- iBCS on Intel
- sparc32 emulation on sparc64
(need to support whatever new 32-bit UID system calls are added to
sparc32)
- Validate that all filesystems behave properly.
At present, 32-bit UIDs _should_ work for:
- ext2
- ufs
- isofs
- nfs
- coda
- udf
Ioctl() fixups have been made for:
- ncpfs
- smbfs
Filesystems with simple fixups to prevent 16-bit UID wraparound:
- minix
- sysv
- qnx4
Other filesystems have not been checked yet.
- The ncpfs and smpfs filesystems cannot presently use 32-bit UIDs in
all ioctl()s. Some new ioctl()s have been added with 32-bit UIDs, but
more are needed. (as well as new user<->kernel data structures)
- The ELF core dump format only supports 16-bit UIDs on arm, i386, m68k,
sh, and sparc32. Fixing this is probably not that important, but would
require adding a new ELF section.
- The ioctl()s used to control the in-kernel NFS server only support
16-bit UIDs on arm, i386, m68k, sh, and sparc32.
- make sure that the UID mapping feature of AX25 networking works properly
(it should be safe because it's always used a 32-bit integer to
communicate between user and kernel)

View File

@@ -9,5 +9,6 @@ are configurable at compile, boot or run time.
.. toctree::
:maxdepth: 1
spectre
l1tf
mds

View File

@@ -241,7 +241,7 @@ Guest mitigation mechanisms
For further information about confining guests to a single or to a group
of cores consult the cpusets documentation:
https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v1/cpusets.rst
.. _interrupt_isolation:

View File

@@ -0,0 +1,697 @@
.. SPDX-License-Identifier: GPL-2.0
Spectre Side Channels
=====================
Spectre is a class of side channel attacks that exploit branch prediction
and speculative execution on modern CPUs to read memory, possibly
bypassing access controls. Speculative execution side channel exploits
do not modify memory but attempt to infer privileged data in the memory.
This document covers Spectre variant 1 and Spectre variant 2.
Affected processors
-------------------
Speculative execution side channel methods affect a wide range of modern
high performance processors, since most modern high speed processors
use branch prediction and speculative execution.
The following CPUs are vulnerable:
- Intel Core, Atom, Pentium, and Xeon processors
- AMD Phenom, EPYC, and Zen processors
- IBM POWER and zSeries processors
- Higher end ARM processors
- Apple CPUs
- Higher end MIPS CPUs
- Likely most other high performance CPUs. Contact your CPU vendor for details.
Whether a processor is affected or not can be read out from the Spectre
vulnerability files in sysfs. See :ref:`spectre_sys_info`.
Related CVEs
------------
The following CVE entries describe Spectre variants:
============= ======================= =================
CVE-2017-5753 Bounds check bypass Spectre variant 1
CVE-2017-5715 Branch target injection Spectre variant 2
============= ======================= =================
Problem
-------
CPUs use speculative operations to improve performance. That may leave
traces of memory accesses or computations in the processor's caches,
buffers, and branch predictors. Malicious software may be able to
influence the speculative execution paths, and then use the side effects
of the speculative execution in the CPUs' caches and buffers to infer
privileged data touched during the speculative execution.
Spectre variant 1 attacks take advantage of speculative execution of
conditional branches, while Spectre variant 2 attacks use speculative
execution of indirect branches to leak privileged memory.
See :ref:`[1] <spec_ref1>` :ref:`[5] <spec_ref5>` :ref:`[7] <spec_ref7>`
:ref:`[10] <spec_ref10>` :ref:`[11] <spec_ref11>`.
Spectre variant 1 (Bounds Check Bypass)
---------------------------------------
The bounds check bypass attack :ref:`[2] <spec_ref2>` takes advantage
of speculative execution that bypasses conditional branch instructions
used for memory access bounds check (e.g. checking if the index of an
array results in memory access within a valid range). This results in
memory accesses to invalid memory (with out-of-bound index) that are
done speculatively before validation checks resolve. Such speculative
memory accesses can leave side effects, creating side channels which
leak information to the attacker.
There are some extensions of Spectre variant 1 attacks for reading data
over the network, see :ref:`[12] <spec_ref12>`. However such attacks
are difficult, low bandwidth, fragile, and are considered low risk.
Spectre variant 2 (Branch Target Injection)
-------------------------------------------
The branch target injection attack takes advantage of speculative
execution of indirect branches :ref:`[3] <spec_ref3>`. The indirect
branch predictors inside the processor used to guess the target of
indirect branches can be influenced by an attacker, causing gadget code
to be speculatively executed, thus exposing sensitive data touched by
the victim. The side effects left in the CPU's caches during speculative
execution can be measured to infer data values.
.. _poison_btb:
In Spectre variant 2 attacks, the attacker can steer speculative indirect
branches in the victim to gadget code by poisoning the branch target
buffer of a CPU used for predicting indirect branch addresses. Such
poisoning could be done by indirect branching into existing code,
with the address offset of the indirect branch under the attacker's
control. Since the branch prediction on impacted hardware does not
fully disambiguate branch address and uses the offset for prediction,
this could cause privileged code's indirect branch to jump to a gadget
code with the same offset.
The most useful gadgets take an attacker-controlled input parameter (such
as a register value) so that the memory read can be controlled. Gadgets
without input parameters might be possible, but the attacker would have
very little control over what memory can be read, reducing the risk of
the attack revealing useful data.
One other variant 2 attack vector is for the attacker to poison the
return stack buffer (RSB) :ref:`[13] <spec_ref13>` to cause speculative
subroutine return instruction execution to go to a gadget. An attacker's
imbalanced subroutine call instructions might "poison" entries in the
return stack buffer which are later consumed by a victim's subroutine
return instructions. This attack can be mitigated by flushing the return
stack buffer on context switch, or virtual machine (VM) exit.
On systems with simultaneous multi-threading (SMT), attacks are possible
from the sibling thread, as level 1 cache and branch target buffer
(BTB) may be shared between hardware threads in a CPU core. A malicious
program running on the sibling thread may influence its peer's BTB to
steer its indirect branch speculations to gadget code, and measure the
speculative execution's side effects left in level 1 cache to infer the
victim's data.
Attack scenarios
----------------
The following list of attack scenarios have been anticipated, but may
not cover all possible attack vectors.
1. A user process attacking the kernel
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The attacker passes a parameter to the kernel via a register or
via a known address in memory during a syscall. Such parameter may
be used later by the kernel as an index to an array or to derive
a pointer for a Spectre variant 1 attack. The index or pointer
is invalid, but bound checks are bypassed in the code branch taken
for speculative execution. This could cause privileged memory to be
accessed and leaked.
For kernel code that has been identified where data pointers could
potentially be influenced for Spectre attacks, new "nospec" accessor
macros are used to prevent speculative loading of data.
Spectre variant 2 attacker can :ref:`poison <poison_btb>` the branch
target buffer (BTB) before issuing syscall to launch an attack.
After entering the kernel, the kernel could use the poisoned branch
target buffer on indirect jump and jump to gadget code in speculative
execution.
If an attacker tries to control the memory addresses leaked during
speculative execution, he would also need to pass a parameter to the
gadget, either through a register or a known address in memory. After
the gadget has executed, he can measure the side effect.
The kernel can protect itself against consuming poisoned branch
target buffer entries by using return trampolines (also known as
"retpoline") :ref:`[3] <spec_ref3>` :ref:`[9] <spec_ref9>` for all
indirect branches. Return trampolines trap speculative execution paths
to prevent jumping to gadget code during speculative execution.
x86 CPUs with Enhanced Indirect Branch Restricted Speculation
(Enhanced IBRS) available in hardware should use the feature to
mitigate Spectre variant 2 instead of retpoline. Enhanced IBRS is
more efficient than retpoline.
There may be gadget code in firmware which could be exploited with
Spectre variant 2 attack by a rogue user process. To mitigate such
attacks on x86, Indirect Branch Restricted Speculation (IBRS) feature
is turned on before the kernel invokes any firmware code.
2. A user process attacking another user process
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A malicious user process can try to attack another user process,
either via a context switch on the same hardware thread, or from the
sibling hyperthread sharing a physical processor core on simultaneous
multi-threading (SMT) system.
Spectre variant 1 attacks generally require passing parameters
between the processes, which needs a data passing relationship, such
as remote procedure calls (RPC). Those parameters are used in gadget
code to derive invalid data pointers accessing privileged memory in
the attacked process.
Spectre variant 2 attacks can be launched from a rogue process by
:ref:`poisoning <poison_btb>` the branch target buffer. This can
influence the indirect branch targets for a victim process that either
runs later on the same hardware thread, or running concurrently on
a sibling hardware thread sharing the same physical core.
A user process can protect itself against Spectre variant 2 attacks
by using the prctl() syscall to disable indirect branch speculation
for itself. An administrator can also cordon off an unsafe process
from polluting the branch target buffer by disabling the process's
indirect branch speculation. This comes with a performance cost
from not using indirect branch speculation and clearing the branch
target buffer. When SMT is enabled on x86, for a process that has
indirect branch speculation disabled, Single Threaded Indirect Branch
Predictors (STIBP) :ref:`[4] <spec_ref4>` are turned on to prevent the
sibling thread from controlling branch target buffer. In addition,
the Indirect Branch Prediction Barrier (IBPB) is issued to clear the
branch target buffer when context switching to and from such process.
On x86, the return stack buffer is stuffed on context switch.
This prevents the branch target buffer from being used for branch
prediction when the return stack buffer underflows while switching to
a deeper call stack. Any poisoned entries in the return stack buffer
left by the previous process will also be cleared.
User programs should use address space randomization to make attacks
more difficult (Set /proc/sys/kernel/randomize_va_space = 1 or 2).
3. A virtualized guest attacking the host
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The attack mechanism is similar to how user processes attack the
kernel. The kernel is entered via hyper-calls or other virtualization
exit paths.
For Spectre variant 1 attacks, rogue guests can pass parameters
(e.g. in registers) via hyper-calls to derive invalid pointers to
speculate into privileged memory after entering the kernel. For places
where such kernel code has been identified, nospec accessor macros
are used to stop speculative memory access.
For Spectre variant 2 attacks, rogue guests can :ref:`poison
<poison_btb>` the branch target buffer or return stack buffer, causing
the kernel to jump to gadget code in the speculative execution paths.
To mitigate variant 2, the host kernel can use return trampolines
for indirect branches to bypass the poisoned branch target buffer,
and flushing the return stack buffer on VM exit. This prevents rogue
guests from affecting indirect branching in the host kernel.
To protect host processes from rogue guests, host processes can have
indirect branch speculation disabled via prctl(). The branch target
buffer is cleared before context switching to such processes.
4. A virtualized guest attacking other guest
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A rogue guest may attack another guest to get data accessible by the
other guest.
Spectre variant 1 attacks are possible if parameters can be passed
between guests. This may be done via mechanisms such as shared memory
or message passing. Such parameters could be used to derive data
pointers to privileged data in guest. The privileged data could be
accessed by gadget code in the victim's speculation paths.
Spectre variant 2 attacks can be launched from a rogue guest by
:ref:`poisoning <poison_btb>` the branch target buffer or the return
stack buffer. Such poisoned entries could be used to influence
speculation execution paths in the victim guest.
Linux kernel mitigates attacks to other guests running in the same
CPU hardware thread by flushing the return stack buffer on VM exit,
and clearing the branch target buffer before switching to a new guest.
If SMT is used, Spectre variant 2 attacks from an untrusted guest
in the sibling hyperthread can be mitigated by the administrator,
by turning off the unsafe guest's indirect branch speculation via
prctl(). A guest can also protect itself by turning on microcode
based mitigations (such as IBPB or STIBP on x86) within the guest.
.. _spectre_sys_info:
Spectre system information
--------------------------
The Linux kernel provides a sysfs interface to enumerate the current
mitigation status of the system for Spectre: whether the system is
vulnerable, and which mitigations are active.
The sysfs file showing Spectre variant 1 mitigation status is:
/sys/devices/system/cpu/vulnerabilities/spectre_v1
The possible values in this file are:
======================================= =================================
'Mitigation: __user pointer sanitation' Protection in kernel on a case by
case base with explicit pointer
sanitation.
======================================= =================================
However, the protections are put in place on a case by case basis,
and there is no guarantee that all possible attack vectors for Spectre
variant 1 are covered.
The spectre_v2 kernel file reports if the kernel has been compiled with
retpoline mitigation or if the CPU has hardware mitigation, and if the
CPU has support for additional process-specific mitigation.
This file also reports CPU features enabled by microcode to mitigate
attack between user processes:
1. Indirect Branch Prediction Barrier (IBPB) to add additional
isolation between processes of different users.
2. Single Thread Indirect Branch Predictors (STIBP) to add additional
isolation between CPU threads running on the same core.
These CPU features may impact performance when used and can be enabled
per process on a case-by-case base.
The sysfs file showing Spectre variant 2 mitigation status is:
/sys/devices/system/cpu/vulnerabilities/spectre_v2
The possible values in this file are:
- Kernel status:
==================================== =================================
'Not affected' The processor is not vulnerable
'Vulnerable' Vulnerable, no mitigation
'Mitigation: Full generic retpoline' Software-focused mitigation
'Mitigation: Full AMD retpoline' AMD-specific software mitigation
'Mitigation: Enhanced IBRS' Hardware-focused mitigation
==================================== =================================
- Firmware status: Show if Indirect Branch Restricted Speculation (IBRS) is
used to protect against Spectre variant 2 attacks when calling firmware (x86 only).
========== =============================================================
'IBRS_FW' Protection against user program attacks when calling firmware
========== =============================================================
- Indirect branch prediction barrier (IBPB) status for protection between
processes of different users. This feature can be controlled through
prctl() per process, or through kernel command line options. This is
an x86 only feature. For more details see below.
=================== ========================================================
'IBPB: disabled' IBPB unused
'IBPB: always-on' Use IBPB on all tasks
'IBPB: conditional' Use IBPB on SECCOMP or indirect branch restricted tasks
=================== ========================================================
- Single threaded indirect branch prediction (STIBP) status for protection
between different hyper threads. This feature can be controlled through
prctl per process, or through kernel command line options. This is x86
only feature. For more details see below.
==================== ========================================================
'STIBP: disabled' STIBP unused
'STIBP: forced' Use STIBP on all tasks
'STIBP: conditional' Use STIBP on SECCOMP or indirect branch restricted tasks
==================== ========================================================
- Return stack buffer (RSB) protection status:
============= ===========================================
'RSB filling' Protection of RSB on context switch enabled
============= ===========================================
Full mitigation might require a microcode update from the CPU
vendor. When the necessary microcode is not available, the kernel will
report vulnerability.
Turning on mitigation for Spectre variant 1 and Spectre variant 2
-----------------------------------------------------------------
1. Kernel mitigation
^^^^^^^^^^^^^^^^^^^^
For the Spectre variant 1, vulnerable kernel code (as determined
by code audit or scanning tools) is annotated on a case by case
basis to use nospec accessor macros for bounds clipping :ref:`[2]
<spec_ref2>` to avoid any usable disclosure gadgets. However, it may
not cover all attack vectors for Spectre variant 1.
For Spectre variant 2 mitigation, the compiler turns indirect calls or
jumps in the kernel into equivalent return trampolines (retpolines)
:ref:`[3] <spec_ref3>` :ref:`[9] <spec_ref9>` to go to the target
addresses. Speculative execution paths under retpolines are trapped
in an infinite loop to prevent any speculative execution jumping to
a gadget.
To turn on retpoline mitigation on a vulnerable CPU, the kernel
needs to be compiled with a gcc compiler that supports the
-mindirect-branch=thunk-extern -mindirect-branch-register options.
If the kernel is compiled with a Clang compiler, the compiler needs
to support -mretpoline-external-thunk option. The kernel config
CONFIG_RETPOLINE needs to be turned on, and the CPU needs to run with
the latest updated microcode.
On Intel Skylake-era systems the mitigation covers most, but not all,
cases. See :ref:`[3] <spec_ref3>` for more details.
On CPUs with hardware mitigation for Spectre variant 2 (e.g. Enhanced
IBRS on x86), retpoline is automatically disabled at run time.
The retpoline mitigation is turned on by default on vulnerable
CPUs. It can be forced on or off by the administrator
via the kernel command line and sysfs control files. See
:ref:`spectre_mitigation_control_command_line`.
On x86, indirect branch restricted speculation is turned on by default
before invoking any firmware code to prevent Spectre variant 2 exploits
using the firmware.
Using kernel address space randomization (CONFIG_RANDOMIZE_SLAB=y
and CONFIG_SLAB_FREELIST_RANDOM=y in the kernel configuration) makes
attacks on the kernel generally more difficult.
2. User program mitigation
^^^^^^^^^^^^^^^^^^^^^^^^^^
User programs can mitigate Spectre variant 1 using LFENCE or "bounds
clipping". For more details see :ref:`[2] <spec_ref2>`.
For Spectre variant 2 mitigation, individual user programs
can be compiled with return trampolines for indirect branches.
This protects them from consuming poisoned entries in the branch
target buffer left by malicious software. Alternatively, the
programs can disable their indirect branch speculation via prctl()
(See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
On x86, this will turn on STIBP to guard against attacks from the
sibling thread when the user program is running, and use IBPB to
flush the branch target buffer when switching to/from the program.
Restricting indirect branch speculation on a user program will
also prevent the program from launching a variant 2 attack
on x86. All sand-boxed SECCOMP programs have indirect branch
speculation restricted by default. Administrators can change
that behavior via the kernel command line and sysfs control files.
See :ref:`spectre_mitigation_control_command_line`.
Programs that disable their indirect branch speculation will have
more overhead and run slower.
User programs should use address space randomization
(/proc/sys/kernel/randomize_va_space = 1 or 2) to make attacks more
difficult.
3. VM mitigation
^^^^^^^^^^^^^^^^
Within the kernel, Spectre variant 1 attacks from rogue guests are
mitigated on a case by case basis in VM exit paths. Vulnerable code
uses nospec accessor macros for "bounds clipping", to avoid any
usable disclosure gadgets. However, this may not cover all variant
1 attack vectors.
For Spectre variant 2 attacks from rogue guests to the kernel, the
Linux kernel uses retpoline or Enhanced IBRS to prevent consumption of
poisoned entries in branch target buffer left by rogue guests. It also
flushes the return stack buffer on every VM exit to prevent a return
stack buffer underflow so poisoned branch target buffer could be used,
or attacker guests leaving poisoned entries in the return stack buffer.
To mitigate guest-to-guest attacks in the same CPU hardware thread,
the branch target buffer is sanitized by flushing before switching
to a new guest on a CPU.
The above mitigations are turned on by default on vulnerable CPUs.
To mitigate guest-to-guest attacks from sibling thread when SMT is
in use, an untrusted guest running in the sibling thread can have
its indirect branch speculation disabled by administrator via prctl().
The kernel also allows guests to use any microcode based mitigation
they choose to use (such as IBPB or STIBP on x86) to protect themselves.
.. _spectre_mitigation_control_command_line:
Mitigation control on the kernel command line
---------------------------------------------
Spectre variant 2 mitigation can be disabled or force enabled at the
kernel command line.
nospectre_v2
[X86] Disable all mitigations for the Spectre variant 2
(indirect branch prediction) vulnerability. System may
allow data leaks with this option, which is equivalent
to spectre_v2=off.
spectre_v2=
[X86] Control mitigation of Spectre variant 2
(indirect branch speculation) vulnerability.
The default operation protects the kernel from
user space attacks.
on
unconditionally enable, implies
spectre_v2_user=on
off
unconditionally disable, implies
spectre_v2_user=off
auto
kernel detects whether your CPU model is
vulnerable
Selecting 'on' will, and 'auto' may, choose a
mitigation method at run time according to the
CPU, the available microcode, the setting of the
CONFIG_RETPOLINE configuration option, and the
compiler with which the kernel was built.
Selecting 'on' will also enable the mitigation
against user space to user space task attacks.
Selecting 'off' will disable both the kernel and
the user space protections.
Specific mitigations can also be selected manually:
retpoline
replace indirect branches
retpoline,generic
google's original retpoline
retpoline,amd
AMD-specific minimal thunk
Not specifying this option is equivalent to
spectre_v2=auto.
For user space mitigation:
spectre_v2_user=
[X86] Control mitigation of Spectre variant 2
(indirect branch speculation) vulnerability between
user space tasks
on
Unconditionally enable mitigations. Is
enforced by spectre_v2=on
off
Unconditionally disable mitigations. Is
enforced by spectre_v2=off
prctl
Indirect branch speculation is enabled,
but mitigation can be enabled via prctl
per thread. The mitigation control state
is inherited on fork.
prctl,ibpb
Like "prctl" above, but only STIBP is
controlled per thread. IBPB is issued
always when switching between different user
space processes.
seccomp
Same as "prctl" above, but all seccomp
threads will enable the mitigation unless
they explicitly opt out.
seccomp,ibpb
Like "seccomp" above, but only STIBP is
controlled per thread. IBPB is issued
always when switching between different
user space processes.
auto
Kernel selects the mitigation depending on
the available CPU features and vulnerability.
Default mitigation:
If CONFIG_SECCOMP=y then "seccomp", otherwise "prctl"
Not specifying this option is equivalent to
spectre_v2_user=auto.
In general the kernel by default selects
reasonable mitigations for the current CPU. To
disable Spectre variant 2 mitigations, boot with
spectre_v2=off. Spectre variant 1 mitigations
cannot be disabled.
Mitigation selection guide
--------------------------
1. Trusted userspace
^^^^^^^^^^^^^^^^^^^^
If all userspace applications are from trusted sources and do not
execute externally supplied untrusted code, then the mitigations can
be disabled.
2. Protect sensitive programs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For security-sensitive programs that have secrets (e.g. crypto
keys), protection against Spectre variant 2 can be put in place by
disabling indirect branch speculation when the program is running
(See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
3. Sandbox untrusted programs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Untrusted programs that could be a source of attacks can be cordoned
off by disabling their indirect branch speculation when they are run
(See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
This prevents untrusted programs from polluting the branch target
buffer. All programs running in SECCOMP sandboxes have indirect
branch speculation restricted by default. This behavior can be
changed via the kernel command line and sysfs control files. See
:ref:`spectre_mitigation_control_command_line`.
3. High security mode
^^^^^^^^^^^^^^^^^^^^^
All Spectre variant 2 mitigations can be forced on
at boot time for all programs (See the "on" option in
:ref:`spectre_mitigation_control_command_line`). This will add
overhead as indirect branch speculations for all programs will be
restricted.
On x86, branch target buffer will be flushed with IBPB when switching
to a new program. STIBP is left on all the time to protect programs
against variant 2 attacks originating from programs running on
sibling threads.
Alternatively, STIBP can be used only when running programs
whose indirect branch speculation is explicitly disabled,
while IBPB is still used all the time when switching to a new
program to clear the branch target buffer (See "ibpb" option in
:ref:`spectre_mitigation_control_command_line`). This "ibpb" option
has less performance cost than the "on" option, which leaves STIBP
on all the time.
References on Spectre
---------------------
Intel white papers:
.. _spec_ref1:
[1] `Intel analysis of speculative execution side channels <https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf>`_.
.. _spec_ref2:
[2] `Bounds check bypass <https://software.intel.com/security-software-guidance/software-guidance/bounds-check-bypass>`_.
.. _spec_ref3:
[3] `Deep dive: Retpoline: A branch target injection mitigation <https://software.intel.com/security-software-guidance/insights/deep-dive-retpoline-branch-target-injection-mitigation>`_.
.. _spec_ref4:
[4] `Deep Dive: Single Thread Indirect Branch Predictors <https://software.intel.com/security-software-guidance/insights/deep-dive-single-thread-indirect-branch-predictors>`_.
AMD white papers:
.. _spec_ref5:
[5] `AMD64 technology indirect branch control extension <https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Update_Indirect_Branch_Control.pdf>`_.
.. _spec_ref6:
[6] `Software techniques for managing speculation on AMD processors <https://developer.amd.com/wp-content/resources/90343-B_SoftwareTechniquesforManagingSpeculation_WP_7-18Update_FNL.pdf>`_.
ARM white papers:
.. _spec_ref7:
[7] `Cache speculation side-channels <https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability/download-the-whitepaper>`_.
.. _spec_ref8:
[8] `Cache speculation issues update <https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability/latest-updates/cache-speculation-issues-update>`_.
Google white paper:
.. _spec_ref9:
[9] `Retpoline: a software construct for preventing branch-target-injection <https://support.google.com/faqs/answer/7625886>`_.
MIPS white paper:
.. _spec_ref10:
[10] `MIPS: response on speculative execution and side channel vulnerabilities <https://www.mips.com/blog/mips-response-on-speculative-execution-and-side-channel-vulnerabilities/>`_.
Academic papers:
.. _spec_ref11:
[11] `Spectre Attacks: Exploiting Speculative Execution <https://spectreattack.com/spectre.pdf>`_.
.. _spec_ref12:
[12] `NetSpectre: Read Arbitrary Memory over Network <https://arxiv.org/abs/1807.10535>`_.
.. _spec_ref13:
[13] `Spectre Returns! Speculation Attacks using the Return Stack Buffer <https://www.usenix.org/system/files/conference/woot18/woot18-paper-koruyeh.pdf>`_.

View File

@@ -0,0 +1,105 @@
==========================================================
Linux support for random number generator in i8xx chipsets
==========================================================
Introduction
============
The hw_random framework is software that makes use of a
special hardware feature on your CPU or motherboard,
a Random Number Generator (RNG). The software has two parts:
a core providing the /dev/hwrng character device and its
sysfs support, plus a hardware-specific driver that plugs
into that core.
To make the most effective use of these mechanisms, you
should download the support software as well. Download the
latest version of the "rng-tools" package from the
hw_random driver's official Web site:
http://sourceforge.net/projects/gkernel/
Those tools use /dev/hwrng to fill the kernel entropy pool,
which is used internally and exported by the /dev/urandom and
/dev/random special files.
Theory of operation
===================
CHARACTER DEVICE. Using the standard open()
and read() system calls, you can read random data from
the hardware RNG device. This data is NOT CHECKED by any
fitness tests, and could potentially be bogus (if the
hardware is faulty or has been tampered with). Data is only
output if the hardware "has-data" flag is set, but nevertheless
a security-conscious person would run fitness tests on the
data before assuming it is truly random.
The rng-tools package uses such tests in "rngd", and lets you
run them by hand with a "rngtest" utility.
/dev/hwrng is char device major 10, minor 183.
CLASS DEVICE. There is a /sys/class/misc/hw_random node with
two unique attributes, "rng_available" and "rng_current". The
"rng_available" attribute lists the hardware-specific drivers
available, while "rng_current" lists the one which is currently
connected to /dev/hwrng. If your system has more than one
RNG available, you may change the one used by writing a name from
the list in "rng_available" into "rng_current".
==========================================================================
Hardware driver for Intel/AMD/VIA Random Number Generators (RNG)
- Copyright 2000,2001 Jeff Garzik <jgarzik@pobox.com>
- Copyright 2000,2001 Philipp Rumpf <prumpf@mandrakesoft.com>
About the Intel RNG hardware, from the firmware hub datasheet
=============================================================
The Firmware Hub integrates a Random Number Generator (RNG)
using thermal noise generated from inherently random quantum
mechanical properties of silicon. When not generating new random
bits the RNG circuitry will enter a low power state. Intel will
provide a binary software driver to give third party software
access to our RNG for use as a security feature. At this time,
the RNG is only to be used with a system in an OS-present state.
Intel RNG Driver notes
======================
FIXME: support poll(2)
.. note::
request_mem_region was removed, for three reasons:
1) Only one RNG is supported by this driver;
2) The location used by the RNG is a fixed location in
MMIO-addressable memory;
3) users with properly working BIOS e820 handling will always
have the region in which the RNG is located reserved, so
request_mem_region calls always fail for proper setups.
However, for people who use mem=XX, BIOS e820 information is
**not** in /proc/iomem, and request_mem_region(RNG_ADDR) can
succeed.
Driver details
==============
Based on:
Intel 82802AB/82802AC Firmware Hub (FWH) Datasheet
May 1999 Order Number: 290658-002 R
Intel 82802 Firmware Hub:
Random Number Generator
Programmer's Reference Manual
December 1999 Order Number: 298029-001 R
Intel 82802 Firmware HUB Random Number Generator Driver
Copyright (c) 2000 Matt Sottek <msottek@quiknet.com>
Special thanks to Matt Sottek. I did the "guts", he
did the "brains" and all the testing.

View File

@@ -16,6 +16,7 @@ etc.
README
kernel-parameters
devices
sysctl/index
This section describes CPU vulnerabilities and their mitigations.
@@ -38,6 +39,8 @@ problems and bugs in particular.
ramoops
dynamic-debug-howto
init
kdump/index
perf/index
This is the beginning of a section with information of interest to
application developers. Documents covering various aspects of the kernel
@@ -56,11 +59,13 @@ configure specific aspects of kernel behavior to your liking.
initrd
cgroup-v2
cgroup-v1/index
serial-console
braille-console
parport
md
module-signing
rapidio
sysrq
unicode
vga-softcursor
@@ -69,13 +74,38 @@ configure specific aspects of kernel behavior to your liking.
java
ras
bcache
blockdev/index
ext4
binderfs
xfs
pm/index
thunderbolt
LSM/index
mm/index
namespaces/index
perf-security
acpi/index
aoe/index
btmrvl
clearing-warn-once
cpu-load
cputopology
device-mapper/index
efi-stub
gpio/index
highuid
hw_random
iostats
kernel-per-CPU-kthreads
laptops/index
lcd-panel-cgram
ldm
lockup-watchdogs
numastat
pnp
rtc
svga
video-output
.. only:: subproject and html

View File

@@ -0,0 +1,197 @@
=====================
I/O statistics fields
=====================
Since 2.4.20 (and some versions before, with patches), and 2.5.45,
more extensive disk statistics have been introduced to help measure disk
activity. Tools such as ``sar`` and ``iostat`` typically interpret these and do
the work for you, but in case you are interested in creating your own
tools, the fields are explained here.
In 2.4 now, the information is found as additional fields in
``/proc/partitions``. In 2.6 and upper, the same information is found in two
places: one is in the file ``/proc/diskstats``, and the other is within
the sysfs file system, which must be mounted in order to obtain
the information. Throughout this document we'll assume that sysfs
is mounted on ``/sys``, although of course it may be mounted anywhere.
Both ``/proc/diskstats`` and sysfs use the same source for the information
and so should not differ.
Here are examples of these different formats::
2.4:
3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030
2.6+ sysfs:
446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
35486 38030 38030 38030
2.6+ diskstats:
3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
3 1 hda1 35486 38030 38030 38030
4.18+ diskstats:
3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0
On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have
a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``.
The advantage of one over the other is that the sysfs choice works well
if you are watching a known, small set of disks. ``/proc/diskstats`` may
be a better choice if you are watching a large number of disks because
you'll avoid the overhead of 50, 100, or 500 or more opens/closes with
each snapshot of your disk statistics.
In 2.4, the statistics fields are those after the device name. In
the above example, the first field of statistics would be 446216.
By contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll
find just the eleven fields, beginning with 446216. If you look at
``/proc/diskstats``, the eleven fields will be preceded by the major and
minor device numbers, and device name. Each of these formats provides
eleven fields of statistics, each meaning exactly the same things.
All fields except field 9 are cumulative since boot. Field 9 should
go to zero as I/Os complete; all others only increase (unless they
overflow and wrap). Yes, these are (32-bit or 64-bit) unsigned long
(native word size) numbers, and on a very busy or long-lived system they
may wrap. Applications should be prepared to deal with that; unless
your observations are measured in large numbers of minutes or hours,
they should not wrap twice before you notice them.
Each set of stats only applies to the indicated device; if you want
system-wide stats you'll have to find all the devices and sum them all up.
Field 1 -- # of reads completed
This is the total number of reads completed successfully.
Field 2 -- # of reads merged, field 6 -- # of writes merged
Reads and writes which are adjacent to each other may be merged for
efficiency. Thus two 4K reads may become one 8K read before it is
ultimately handed to the disk, and so it will be counted (and queued)
as only one I/O. This field lets you know how often this was done.
Field 3 -- # of sectors read
This is the total number of sectors read successfully.
Field 4 -- # of milliseconds spent reading
This is the total number of milliseconds spent by all reads (as
measured from __make_request() to end_that_request_last()).
Field 5 -- # of writes completed
This is the total number of writes completed successfully.
Field 6 -- # of writes merged
See the description of field 2.
Field 7 -- # of sectors written
This is the total number of sectors written successfully.
Field 8 -- # of milliseconds spent writing
This is the total number of milliseconds spent by all writes (as
measured from __make_request() to end_that_request_last()).
Field 9 -- # of I/Os currently in progress
The only field that should go to zero. Incremented as requests are
given to appropriate struct request_queue and decremented as they finish.
Field 10 -- # of milliseconds spent doing I/Os
This field increases so long as field 9 is nonzero.
Since 5.0 this field counts jiffies when at least one request was
started or completed. If request runs more than 2 jiffies then some
I/O time will not be accounted unless there are other requests.
Field 11 -- weighted # of milliseconds spent doing I/Os
This field is incremented at each I/O start, I/O completion, I/O
merge, or read of these stats by the number of I/Os in progress
(field 9) times the number of milliseconds spent doing I/O since the
last update of this field. This can provide an easy measure of both
I/O completion time and the backlog that may be accumulating.
Field 12 -- # of discards completed
This is the total number of discards completed successfully.
Field 13 -- # of discards merged
See the description of field 2
Field 14 -- # of sectors discarded
This is the total number of sectors discarded successfully.
Field 15 -- # of milliseconds spent discarding
This is the total number of milliseconds spent by all discards (as
measured from __make_request() to end_that_request_last()).
To avoid introducing performance bottlenecks, no locks are held while
modifying these counters. This implies that minor inaccuracies may be
introduced when changes collide, so (for instance) adding up all the
read I/Os issued per partition should equal those made to the disks ...
but due to the lack of locking it may only be very close.
In 2.6+, there are counters for each CPU, which make the lack of locking
almost a non-issue. When the statistics are read, the per-CPU counters
are summed (possibly overflowing the unsigned long variable they are
summed to) and the result given to the user. There is no convenient
user interface for accessing the per-CPU counters themselves.
Disks vs Partitions
-------------------
There were significant changes between 2.4 and 2.6+ in the I/O subsystem.
As a result, some statistic information disappeared. The translation from
a disk address relative to a partition to the disk address relative to
the host disk happens much earlier. All merges and timings now happen
at the disk level rather than at both the disk and partition level as
in 2.4. Consequently, you'll see a different statistics output on 2.6+ for
partitions from that for disks. There are only *four* fields available
for partitions on 2.6+ machines. This is reflected in the examples above.
Field 1 -- # of reads issued
This is the total number of reads issued to this partition.
Field 2 -- # of sectors read
This is the total number of sectors requested to be read from this
partition.
Field 3 -- # of writes issued
This is the total number of writes issued to this partition.
Field 4 -- # of sectors written
This is the total number of sectors requested to be written to
this partition.
Note that since the address is translated to a disk-relative one, and no
record of the partition-relative address is kept, the subsequent success
or failure of the read cannot be attributed to the partition. In other
words, the number of reads for partitions is counted slightly before time
of queuing for partitions, and at completion for whole disks. This is
a subtle distinction that is probably uninteresting for most cases.
More significant is the error induced by counting the numbers of
reads/writes before merges for partitions and after for disks. Since a
typical workload usually contains a lot of successive and adjacent requests,
the number of reads/writes issued can be several times higher than the
number of reads/writes completed.
In 2.6.25, the full statistic set is again available for partitions and
disk and partition statistics are consistent again. Since we still don't
keep record of the partition-relative address, an operation is attributed to
the partition which contains the first sector of the request after the
eventual merges. As requests can be merged across partition, this could lead
to some (probably insignificant) inaccuracy.
Additional notes
----------------
In 2.6+, sysfs is not mounted by default. If your distribution of
Linux hasn't added it already, here's the line you'll want to add to
your ``/etc/fstab``::
none /sys sysfs defaults 0 0
In 2.6+, all disk statistics were removed from ``/proc/stat``. In 2.4, they
appear in both ``/proc/partitions`` and ``/proc/stat``, although the ones in
``/proc/stat`` take a very different format from those in ``/proc/partitions``
(see proc(5), if your system has it.)
-- ricklind@us.ibm.com

View File

@@ -0,0 +1,264 @@
#
# This file contains a few gdb macros (user defined commands) to extract
# useful information from kernel crashdump (kdump) like stack traces of
# all the processes or a particular process and trapinfo.
#
# These macros can be used by copying this file in .gdbinit (put in home
# directory or current directory) or by invoking gdb command with
# --command=<command-file-name> option
#
# Credits:
# Alexander Nyberg <alexn@telia.com>
# V Srivatsa <vatsa@in.ibm.com>
# Maneesh Soni <maneesh@in.ibm.com>
#
define bttnobp
set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)
set $init_t=&init_task
set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
set var $stacksize = sizeof(union thread_union)
while ($next_t != $init_t)
set $next_t=(struct task_struct *)$next_t
printf "\npid %d; comm %s:\n", $next_t.pid, $next_t.comm
printf "===================\n"
set var $stackp = $next_t.thread.sp
set var $stack_top = ($stackp & ~($stacksize - 1)) + $stacksize
while ($stackp < $stack_top)
if (*($stackp) > _stext && *($stackp) < _sinittext)
info symbol *($stackp)
end
set $stackp += 4
end
set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
while ($next_th != $next_t)
set $next_th=(struct task_struct *)$next_th
printf "\npid %d; comm %s:\n", $next_t.pid, $next_t.comm
printf "===================\n"
set var $stackp = $next_t.thread.sp
set var $stack_top = ($stackp & ~($stacksize - 1)) + stacksize
while ($stackp < $stack_top)
if (*($stackp) > _stext && *($stackp) < _sinittext)
info symbol *($stackp)
end
set $stackp += 4
end
set $next_th=(((char *)$next_th->thread_group.next) - $pid_off)
end
set $next_t=(char *)($next_t->tasks.next) - $tasks_off
end
end
document bttnobp
dump all thread stack traces on a kernel compiled with !CONFIG_FRAME_POINTER
end
define btthreadstack
set var $pid_task = $arg0
printf "\npid %d; comm %s:\n", $pid_task.pid, $pid_task.comm
printf "task struct: "
print $pid_task
printf "===================\n"
set var $stackp = $pid_task.thread.sp
set var $stacksize = sizeof(union thread_union)
set var $stack_top = ($stackp & ~($stacksize - 1)) + $stacksize
set var $stack_bot = ($stackp & ~($stacksize - 1))
set $stackp = *((unsigned long *) $stackp)
while (($stackp < $stack_top) && ($stackp > $stack_bot))
set var $addr = *(((unsigned long *) $stackp) + 1)
info symbol $addr
set $stackp = *((unsigned long *) $stackp)
end
end
document btthreadstack
dump a thread stack using the given task structure pointer
end
define btt
set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)
set $init_t=&init_task
set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
while ($next_t != $init_t)
set $next_t=(struct task_struct *)$next_t
btthreadstack $next_t
set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
while ($next_th != $next_t)
set $next_th=(struct task_struct *)$next_th
btthreadstack $next_th
set $next_th=(((char *)$next_th->thread_group.next) - $pid_off)
end
set $next_t=(char *)($next_t->tasks.next) - $tasks_off
end
end
document btt
dump all thread stack traces on a kernel compiled with CONFIG_FRAME_POINTER
end
define btpid
set var $pid = $arg0
set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)
set $init_t=&init_task
set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
set var $pid_task = 0
while ($next_t != $init_t)
set $next_t=(struct task_struct *)$next_t
if ($next_t.pid == $pid)
set $pid_task = $next_t
end
set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
while ($next_th != $next_t)
set $next_th=(struct task_struct *)$next_th
if ($next_th.pid == $pid)
set $pid_task = $next_th
end
set $next_th=(((char *)$next_th->thread_group.next) - $pid_off)
end
set $next_t=(char *)($next_t->tasks.next) - $tasks_off
end
btthreadstack $pid_task
end
document btpid
backtrace of pid
end
define trapinfo
set var $pid = $arg0
set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)
set $init_t=&init_task
set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
set var $pid_task = 0
while ($next_t != $init_t)
set $next_t=(struct task_struct *)$next_t
if ($next_t.pid == $pid)
set $pid_task = $next_t
end
set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
while ($next_th != $next_t)
set $next_th=(struct task_struct *)$next_th
if ($next_th.pid == $pid)
set $pid_task = $next_th
end
set $next_th=(((char *)$next_th->thread_group.next) - $pid_off)
end
set $next_t=(char *)($next_t->tasks.next) - $tasks_off
end
printf "Trapno %ld, cr2 0x%lx, error_code %ld\n", $pid_task.thread.trap_no, \
$pid_task.thread.cr2, $pid_task.thread.error_code
end
document trapinfo
Run info threads and lookup pid of thread #1
'trapinfo <pid>' will tell you by which trap & possibly
address the kernel panicked.
end
define dump_log_idx
set $idx = $arg0
if ($argc > 1)
set $prev_flags = $arg1
else
set $prev_flags = 0
end
set $msg = ((struct printk_log *) (log_buf + $idx))
set $prefix = 1
set $newline = 1
set $log = log_buf + $idx + sizeof(*$msg)
# prev & LOG_CONT && !(msg->flags & LOG_PREIX)
if (($prev_flags & 8) && !($msg->flags & 4))
set $prefix = 0
end
# msg->flags & LOG_CONT
if ($msg->flags & 8)
# (prev & LOG_CONT && !(prev & LOG_NEWLINE))
if (($prev_flags & 8) && !($prev_flags & 2))
set $prefix = 0
end
# (!(msg->flags & LOG_NEWLINE))
if (!($msg->flags & 2))
set $newline = 0
end
end
if ($prefix)
printf "[%5lu.%06lu] ", $msg->ts_nsec / 1000000000, $msg->ts_nsec % 1000000000
end
if ($msg->text_len != 0)
eval "printf \"%%%d.%ds\", $log", $msg->text_len, $msg->text_len
end
if ($newline)
printf "\n"
end
if ($msg->dict_len > 0)
set $dict = $log + $msg->text_len
set $idx = 0
set $line = 1
while ($idx < $msg->dict_len)
if ($line)
printf " "
set $line = 0
end
set $c = $dict[$idx]
if ($c == '\0')
printf "\n"
set $line = 1
else
if ($c < ' ' || $c >= 127 || $c == '\\')
printf "\\x%02x", $c
else
printf "%c", $c
end
end
set $idx = $idx + 1
end
printf "\n"
end
end
document dump_log_idx
Dump a single log given its index in the log buffer. The first
parameter is the index into log_buf, the second is optional and
specified the previous log buffer's flags, used for properly
formatting continued lines.
end
define dmesg
set $i = log_first_idx
set $end_idx = log_first_idx
set $prev_flags = 0
while (1)
set $msg = ((struct printk_log *) (log_buf + $i))
if ($msg->len == 0)
set $i = 0
else
dump_log_idx $i $prev_flags
set $i = $i + $msg->len
set $prev_flags = $msg->flags
end
if ($i == $end_idx)
loop_break
end
end
end
document dmesg
print the kernel ring buffer
end

View File

@@ -0,0 +1,20 @@
================================================================
Documentation for Kdump - The kexec-based Crash Dumping Solution
================================================================
This document includes overview, setup and installation, and analysis
information.
.. toctree::
:maxdepth: 1
kdump
vmcoreinfo
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@@ -0,0 +1,534 @@
================================================================
Documentation for Kdump - The kexec-based Crash Dumping Solution
================================================================
This document includes overview, setup and installation, and analysis
information.
Overview
========
Kdump uses kexec to quickly boot to a dump-capture kernel whenever a
dump of the system kernel's memory needs to be taken (for example, when
the system panics). The system kernel's memory image is preserved across
the reboot and is accessible to the dump-capture kernel.
You can use common commands, such as cp and scp, to copy the
memory image to a dump file on the local disk, or across the network to
a remote system.
Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
s390x, arm and arm64 architectures.
When the system kernel boots, it reserves a small section of memory for
the dump-capture kernel. This ensures that ongoing Direct Memory Access
(DMA) from the system kernel does not corrupt the dump-capture kernel.
The kexec -p command loads the dump-capture kernel into this reserved
memory.
On x86 machines, the first 640 KB of physical memory is needed to boot,
regardless of where the kernel loads. Therefore, kexec backs up this
region just before rebooting into the dump-capture kernel.
Similarly on PPC64 machines first 32KB of physical memory is needed for
booting regardless of where the kernel is loaded and to support 64K page
size kexec backs up the first 64KB memory.
For s390x, when kdump is triggered, the crashkernel region is exchanged
with the region [0, crashkernel region size] and then the kdump kernel
runs in [0, crashkernel region size]. Therefore no relocatable kernel is
needed for s390x.
All of the necessary information about the system kernel's core image is
encoded in the ELF format, and stored in a reserved area of memory
before a crash. The physical address of the start of the ELF header is
passed to the dump-capture kernel through the elfcorehdr= boot
parameter. Optionally the size of the ELF header can also be passed
when using the elfcorehdr=[size[KMG]@]offset[KMG] syntax.
With the dump-capture kernel, you can access the memory image through
/proc/vmcore. This exports the dump as an ELF-format file that you can
write out using file copy commands such as cp or scp. Further, you can
use analysis tools such as the GNU Debugger (GDB) and the Crash tool to
debug the dump file. This method ensures that the dump pages are correctly
ordered.
Setup and Installation
======================
Install kexec-tools
-------------------
1) Login as the root user.
2) Download the kexec-tools user-space package from the following URL:
http://kernel.org/pub/linux/utils/kernel/kexec/kexec-tools.tar.gz
This is a symlink to the latest version.
The latest kexec-tools git tree is available at:
- git://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
- http://www.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
There is also a gitweb interface available at
http://www.kernel.org/git/?p=utils/kernel/kexec/kexec-tools.git
More information about kexec-tools can be found at
http://horms.net/projects/kexec/
3) Unpack the tarball with the tar command, as follows::
tar xvpzf kexec-tools.tar.gz
4) Change to the kexec-tools directory, as follows::
cd kexec-tools-VERSION
5) Configure the package, as follows::
./configure
6) Compile the package, as follows::
make
7) Install the package, as follows::
make install
Build the system and dump-capture kernels
-----------------------------------------
There are two possible methods of using Kdump.
1) Build a separate custom dump-capture kernel for capturing the
kernel core dump.
2) Or use the system kernel binary itself as dump-capture kernel and there is
no need to build a separate dump-capture kernel. This is possible
only with the architectures which support a relocatable kernel. As
of today, i386, x86_64, ppc64, ia64, arm and arm64 architectures support
relocatable kernel.
Building a relocatable kernel is advantageous from the point of view that
one does not have to build a second kernel for capturing the dump. But
at the same time one might want to build a custom dump capture kernel
suitable to his needs.
Following are the configuration setting required for system and
dump-capture kernels for enabling kdump support.
System kernel config options
----------------------------
1) Enable "kexec system call" in "Processor type and features."::
CONFIG_KEXEC=y
2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo
filesystems." This is usually enabled by default::
CONFIG_SYSFS=y
Note that "sysfs file system support" might not appear in the "Pseudo
filesystems" menu if "Configure standard kernel features (for small
systems)" is not enabled in "General Setup." In this case, check the
.config file itself to ensure that sysfs is turned on, as follows::
grep 'CONFIG_SYSFS' .config
3) Enable "Compile the kernel with debug info" in "Kernel hacking."::
CONFIG_DEBUG_INFO=Y
This causes the kernel to be built with debug symbols. The dump
analysis tools require a vmlinux with debug symbols in order to read
and analyze a dump file.
Dump-capture kernel config options (Arch Independent)
-----------------------------------------------------
1) Enable "kernel crash dumps" support under "Processor type and
features"::
CONFIG_CRASH_DUMP=y
2) Enable "/proc/vmcore support" under "Filesystems" -> "Pseudo filesystems"::
CONFIG_PROC_VMCORE=y
(CONFIG_PROC_VMCORE is set by default when CONFIG_CRASH_DUMP is selected.)
Dump-capture kernel config options (Arch Dependent, i386 and x86_64)
--------------------------------------------------------------------
1) On i386, enable high memory support under "Processor type and
features"::
CONFIG_HIGHMEM64G=y
or::
CONFIG_HIGHMEM4G
2) On i386 and x86_64, disable symmetric multi-processing support
under "Processor type and features"::
CONFIG_SMP=n
(If CONFIG_SMP=y, then specify maxcpus=1 on the kernel command line
when loading the dump-capture kernel, see section "Load the Dump-capture
Kernel".)
3) If one wants to build and use a relocatable kernel,
Enable "Build a relocatable kernel" support under "Processor type and
features"::
CONFIG_RELOCATABLE=y
4) Use a suitable value for "Physical address where the kernel is
loaded" (under "Processor type and features"). This only appears when
"kernel crash dumps" is enabled. A suitable value depends upon
whether kernel is relocatable or not.
If you are using a relocatable kernel use CONFIG_PHYSICAL_START=0x100000
This will compile the kernel for physical address 1MB, but given the fact
kernel is relocatable, it can be run from any physical address hence
kexec boot loader will load it in memory region reserved for dump-capture
kernel.
Otherwise it should be the start of memory region reserved for
second kernel using boot parameter "crashkernel=Y@X". Here X is
start of memory region reserved for dump-capture kernel.
Generally X is 16MB (0x1000000). So you can set
CONFIG_PHYSICAL_START=0x1000000
5) Make and install the kernel and its modules. DO NOT add this kernel
to the boot loader configuration files.
Dump-capture kernel config options (Arch Dependent, ppc64)
----------------------------------------------------------
1) Enable "Build a kdump crash kernel" support under "Kernel" options::
CONFIG_CRASH_DUMP=y
2) Enable "Build a relocatable kernel" support::
CONFIG_RELOCATABLE=y
Make and install the kernel and its modules.
Dump-capture kernel config options (Arch Dependent, ia64)
----------------------------------------------------------
- No specific options are required to create a dump-capture kernel
for ia64, other than those specified in the arch independent section
above. This means that it is possible to use the system kernel
as a dump-capture kernel if desired.
The crashkernel region can be automatically placed by the system
kernel at run time. This is done by specifying the base address as 0,
or omitting it all together::
crashkernel=256M@0
or::
crashkernel=256M
If the start address is specified, note that the start address of the
kernel will be aligned to 64Mb, so if the start address is not then
any space below the alignment point will be wasted.
Dump-capture kernel config options (Arch Dependent, arm)
----------------------------------------------------------
- To use a relocatable kernel,
Enable "AUTO_ZRELADDR" support under "Boot" options::
AUTO_ZRELADDR=y
Dump-capture kernel config options (Arch Dependent, arm64)
----------------------------------------------------------
- Please note that kvm of the dump-capture kernel will not be enabled
on non-VHE systems even if it is configured. This is because the CPU
will not be reset to EL2 on panic.
Extended crashkernel syntax
===========================
While the "crashkernel=size[@offset]" syntax is sufficient for most
configurations, sometimes it's handy to have the reserved memory dependent
on the value of System RAM -- that's mostly for distributors that pre-setup
the kernel command line to avoid a unbootable system after some memory has
been removed from the machine.
The syntax is::
crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]
range=start-[end]
For example::
crashkernel=512M-2G:64M,2G-:128M
This would mean:
1) if the RAM is smaller than 512M, then don't reserve anything
(this is the "rescue" case)
2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
3) if the RAM size is larger than 2G, then reserve 128M
Boot into System Kernel
=======================
1) Update the boot loader (such as grub, yaboot, or lilo) configuration
files as necessary.
2) Boot the system kernel with the boot parameter "crashkernel=Y@X",
where Y specifies how much memory to reserve for the dump-capture kernel
and X specifies the beginning of this reserved memory. For example,
"crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
starting at physical address 0x01000000 (16MB) for the dump-capture kernel.
On x86 and x86_64, use "crashkernel=64M@16M".
On ppc64, use "crashkernel=128M@32M".
On ia64, 256M@256M is a generous value that typically works.
The region may be automatically placed on ia64, see the
dump-capture kernel config option notes above.
If use sparse memory, the size should be rounded to GRANULE boundaries.
On s390x, typically use "crashkernel=xxM". The value of xx is dependent
on the memory consumption of the kdump system. In general this is not
dependent on the memory size of the production system.
On arm, the use of "crashkernel=Y@X" is no longer necessary; the
kernel will automatically locate the crash kernel image within the
first 512MB of RAM if X is not given.
On arm64, use "crashkernel=Y[@X]". Note that the start address of
the kernel, X if explicitly specified, must be aligned to 2MiB (0x200000).
Load the Dump-capture Kernel
============================
After booting to the system kernel, dump-capture kernel needs to be
loaded.
Based on the architecture and type of image (relocatable or not), one
can choose to load the uncompressed vmlinux or compressed bzImage/vmlinuz
of dump-capture kernel. Following is the summary.
For i386 and x86_64:
- Use vmlinux if kernel is not relocatable.
- Use bzImage/vmlinuz if kernel is relocatable.
For ppc64:
- Use vmlinux
For ia64:
- Use vmlinux or vmlinuz.gz
For s390x:
- Use image or bzImage
For arm:
- Use zImage
For arm64:
- Use vmlinux or Image
If you are using an uncompressed vmlinux image then use following command
to load dump-capture kernel::
kexec -p <dump-capture-kernel-vmlinux-image> \
--initrd=<initrd-for-dump-capture-kernel> --args-linux \
--append="root=<root-dev> <arch-specific-options>"
If you are using a compressed bzImage/vmlinuz, then use following command
to load dump-capture kernel::
kexec -p <dump-capture-kernel-bzImage> \
--initrd=<initrd-for-dump-capture-kernel> \
--append="root=<root-dev> <arch-specific-options>"
If you are using a compressed zImage, then use following command
to load dump-capture kernel::
kexec --type zImage -p <dump-capture-kernel-bzImage> \
--initrd=<initrd-for-dump-capture-kernel> \
--dtb=<dtb-for-dump-capture-kernel> \
--append="root=<root-dev> <arch-specific-options>"
If you are using an uncompressed Image, then use following command
to load dump-capture kernel::
kexec -p <dump-capture-kernel-Image> \
--initrd=<initrd-for-dump-capture-kernel> \
--append="root=<root-dev> <arch-specific-options>"
Please note, that --args-linux does not need to be specified for ia64.
It is planned to make this a no-op on that architecture, but for now
it should be omitted
Following are the arch specific command line options to be used while
loading dump-capture kernel.
For i386, x86_64 and ia64:
"1 irqpoll maxcpus=1 reset_devices"
For ppc64:
"1 maxcpus=1 noirqdistrib reset_devices"
For s390x:
"1 maxcpus=1 cgroup_disable=memory"
For arm:
"1 maxcpus=1 reset_devices"
For arm64:
"1 maxcpus=1 reset_devices"
Notes on loading the dump-capture kernel:
* By default, the ELF headers are stored in ELF64 format to support
systems with more than 4GB memory. On i386, kexec automatically checks if
the physical RAM size exceeds the 4 GB limit and if not, uses ELF32.
So, on non-PAE systems, ELF32 is always used.
The --elf32-core-headers option can be used to force the generation of ELF32
headers. This is necessary because GDB currently cannot open vmcore files
with ELF64 headers on 32-bit systems.
* The "irqpoll" boot parameter reduces driver initialization failures
due to shared interrupts in the dump-capture kernel.
* You must specify <root-dev> in the format corresponding to the root
device name in the output of mount command.
* Boot parameter "1" boots the dump-capture kernel into single-user
mode without networking. If you want networking, use "3".
* We generally don't have to bring up a SMP kernel just to capture the
dump. Hence generally it is useful either to build a UP dump-capture
kernel or specify maxcpus=1 option while loading dump-capture kernel.
Note, though maxcpus always works, you had better replace it with
nr_cpus to save memory if supported by the current ARCH, such as x86.
* You should enable multi-cpu support in dump-capture kernel if you intend
to use multi-thread programs with it, such as parallel dump feature of
makedumpfile. Otherwise, the multi-thread program may have a great
performance degradation. To enable multi-cpu support, you should bring up an
SMP dump-capture kernel and specify maxcpus/nr_cpus, disable_cpu_apicid=[X]
options while loading it.
* For s390x there are two kdump modes: If a ELF header is specified with
the elfcorehdr= kernel parameter, it is used by the kdump kernel as it
is done on all other architectures. If no elfcorehdr= kernel parameter is
specified, the s390x kdump kernel dynamically creates the header. The
second mode has the advantage that for CPU and memory hotplug, kdump has
not to be reloaded with kexec_load().
* For s390x systems with many attached devices the "cio_ignore" kernel
parameter should be used for the kdump kernel in order to prevent allocation
of kernel memory for devices that are not relevant for kdump. The same
applies to systems that use SCSI/FCP devices. In that case the
"allow_lun_scan" zfcp module parameter should be set to zero before
setting FCP devices online.
Kernel Panic
============
After successfully loading the dump-capture kernel as previously
described, the system will reboot into the dump-capture kernel if a
system crash is triggered. Trigger points are located in panic(),
die(), die_nmi() and in the sysrq handler (ALT-SysRq-c).
The following conditions will execute a crash trigger point:
If a hard lockup is detected and "NMI watchdog" is configured, the system
will boot into the dump-capture kernel ( die_nmi() ).
If die() is called, and it happens to be a thread with pid 0 or 1, or die()
is called inside interrupt context or die() is called and panic_on_oops is set,
the system will boot into the dump-capture kernel.
On powerpc systems when a soft-reset is generated, die() is called by all cpus
and the system will boot into the dump-capture kernel.
For testing purposes, you can trigger a crash by using "ALT-SysRq-c",
"echo c > /proc/sysrq-trigger" or write a module to force the panic.
Write Out the Dump File
=======================
After the dump-capture kernel is booted, write out the dump file with
the following command::
cp /proc/vmcore <dump-file>
Analysis
========
Before analyzing the dump image, you should reboot into a stable kernel.
You can do limited analysis using GDB on the dump file copied out of
/proc/vmcore. Use the debug vmlinux built with -g and run the following
command::
gdb vmlinux <dump-file>
Stack trace for the task on processor 0, register display, and memory
display work fine.
Note: GDB cannot analyze core files generated in ELF64 format for x86.
On systems with a maximum of 4GB of memory, you can generate
ELF32-format headers using the --elf32-core-headers kernel option on the
dump kernel.
You can also use the Crash utility to analyze dump files in Kdump
format. Crash is available on Dave Anderson's site at the following URL:
http://people.redhat.com/~anderson/
Trigger Kdump on WARN()
=======================
The kernel parameter, panic_on_warn, calls panic() in all WARN() paths. This
will cause a kdump to occur at the panic() call. In cases where a user wants
to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1
to achieve the same behaviour.
Contact
=======
- Vivek Goyal (vgoyal@redhat.com)
- Maneesh Soni (maneesh@in.ibm.com)
GDB macros
==========
.. include:: gdbmacros.txt
:literal:

View File

@@ -0,0 +1,488 @@
==========
VMCOREINFO
==========
What is it?
===========
VMCOREINFO is a special ELF note section. It contains various
information from the kernel like structure size, page size, symbol
values, field offsets, etc. These data are packed into an ELF note
section and used by user-space tools like crash and makedumpfile to
analyze a kernel's memory layout.
Common variables
================
init_uts_ns.name.release
------------------------
The version of the Linux kernel. Used to find the corresponding source
code from which the kernel has been built. For example, crash uses it to
find the corresponding vmlinux in order to process vmcore.
PAGE_SIZE
---------
The size of a page. It is the smallest unit of data used by the memory
management facilities. It is usually 4096 bytes of size and a page is
aligned on 4096 bytes. Used for computing page addresses.
init_uts_ns
-----------
The UTS namespace which is used to isolate two specific elements of the
system that relate to the uname(2) system call. It is named after the
data structure used to store information returned by the uname(2) system
call.
User-space tools can get the kernel name, host name, kernel release
number, kernel version, architecture name and OS type from it.
node_online_map
---------------
An array node_states[N_ONLINE] which represents the set of online nodes
in a system, one bit position per node number. Used to keep track of
which nodes are in the system and online.
swapper_pg_dir
--------------
The global page directory pointer of the kernel. Used to translate
virtual to physical addresses.
_stext
------
Defines the beginning of the text section. In general, _stext indicates
the kernel start address. Used to convert a virtual address from the
direct kernel map to a physical address.
vmap_area_list
--------------
Stores the virtual area list. makedumpfile gets the vmalloc start value
from this variable and its value is necessary for vmalloc translation.
mem_map
-------
Physical addresses are translated to struct pages by treating them as
an index into the mem_map array. Right-shifting a physical address
PAGE_SHIFT bits converts it into a page frame number which is an index
into that mem_map array.
Used to map an address to the corresponding struct page.
contig_page_data
----------------
Makedumpfile gets the pglist_data structure from this symbol, which is
used to describe the memory layout.
User-space tools use this to exclude free pages when dumping memory.
mem_section|(mem_section, NR_SECTION_ROOTS)|(mem_section, section_mem_map)
--------------------------------------------------------------------------
The address of the mem_section array, its length, structure size, and
the section_mem_map offset.
It exists in the sparse memory mapping model, and it is also somewhat
similar to the mem_map variable, both of them are used to translate an
address.
page
----
The size of a page structure. struct page is an important data structure
and it is widely used to compute contiguous memory.
pglist_data
-----------
The size of a pglist_data structure. This value is used to check if the
pglist_data structure is valid. It is also used for checking the memory
type.
zone
----
The size of a zone structure. This value is used to check if the zone
structure has been found. It is also used for excluding free pages.
free_area
---------
The size of a free_area structure. It indicates whether the free_area
structure is valid or not. Useful when excluding free pages.
list_head
---------
The size of a list_head structure. Used when iterating lists in a
post-mortem analysis session.
nodemask_t
----------
The size of a nodemask_t type. Used to compute the number of online
nodes.
(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|compound_order|compound_head)
-------------------------------------------------------------------------------------------------
User-space tools compute their values based on the offset of these
variables. The variables are used when excluding unnecessary pages.
(pglist_data, node_zones|nr_zones|node_mem_map|node_start_pfn|node_spanned_pages|node_id)
-----------------------------------------------------------------------------------------
On NUMA machines, each NUMA node has a pg_data_t to describe its memory
layout. On UMA machines there is a single pglist_data which describes the
whole memory.
These values are used to check the memory type and to compute the
virtual address for memory map.
(zone, free_area|vm_stat|spanned_pages)
---------------------------------------
Each node is divided into a number of blocks called zones which
represent ranges within memory. A zone is described by a structure zone.
User-space tools compute required values based on the offset of these
variables.
(free_area, free_list)
----------------------
Offset of the free_list's member. This value is used to compute the number
of free pages.
Each zone has a free_area structure array called free_area[MAX_ORDER].
The free_list represents a linked list of free page blocks.
(list_head, next|prev)
----------------------
Offsets of the list_head's members. list_head is used to define a
circular linked list. User-space tools need these in order to traverse
lists.
(vmap_area, va_start|list)
--------------------------
Offsets of the vmap_area's members. They carry vmalloc-specific
information. Makedumpfile gets the start address of the vmalloc region
from this.
(zone.free_area, MAX_ORDER)
---------------------------
Free areas descriptor. User-space tools use this value to iterate the
free_area ranges. MAX_ORDER is used by the zone buddy allocator.
log_first_idx
-------------
Index of the first record stored in the buffer log_buf. Used by
user-space tools to read the strings in the log_buf.
log_buf
-------
Console output is written to the ring buffer log_buf at index
log_first_idx. Used to get the kernel log.
log_buf_len
-----------
log_buf's length.
clear_idx
---------
The index that the next printk() record to read after the last clear
command. It indicates the first record after the last SYSLOG_ACTION
_CLEAR, like issued by 'dmesg -c'. Used by user-space tools to dump
the dmesg log.
log_next_idx
------------
The index of the next record to store in the buffer log_buf. Used to
compute the index of the current buffer position.
printk_log
----------
The size of a structure printk_log. Used to compute the size of
messages, and extract dmesg log. It encapsulates header information for
log_buf, such as timestamp, syslog level, etc.
(printk_log, ts_nsec|len|text_len|dict_len)
-------------------------------------------
It represents field offsets in struct printk_log. User space tools
parse it and check whether the values of printk_log's members have been
changed.
(free_area.free_list, MIGRATE_TYPES)
------------------------------------
The number of migrate types for pages. The free_list is described by the
array. Used by tools to compute the number of free pages.
NR_FREE_PAGES
-------------
On linux-2.6.21 or later, the number of free pages is in
vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask
------------------------------------------------------------------------------
Page attributes. These flags are used to filter various unnecessary for
dumping pages.
PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
-----------------------------------------------------------------------------
More page attributes. These flags are used to filter various unnecessary for
dumping pages.
HUGETLB_PAGE_DTOR
-----------------
The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
excludes these pages.
x86_64
======
phys_base
---------
Used to convert the virtual address of an exported kernel symbol to its
corresponding physical address.
init_top_pgt
------------
Used to walk through the whole page table and convert virtual addresses
to physical addresses. The init_top_pgt is somewhat similar to
swapper_pg_dir, but it is only used in x86_64.
pgtable_l5_enabled
------------------
User-space tools need to know whether the crash kernel was in 5-level
paging mode.
node_data
---------
This is a struct pglist_data array and stores all NUMA nodes
information. Makedumpfile gets the pglist_data structure from it.
(node_data, MAX_NUMNODES)
-------------------------
The maximum number of nodes in system.
KERNELOFFSET
------------
The kernel randomization offset. Used to compute the page offset. If
KASLR is disabled, this value is zero.
KERNEL_IMAGE_SIZE
-----------------
Currently unused by Makedumpfile. Used to compute the module virtual
address by Crash.
sme_mask
--------
AMD-specific with SME support: it indicates the secure memory encryption
mask. Makedumpfile tools need to know whether the crash kernel was
encrypted. If SME is enabled in the first kernel, the crash kernel's
page table entries (pgd/pud/pmd/pte) contain the memory encryption
mask. This is used to remove the SME mask and obtain the true physical
address.
Currently, sme_mask stores the value of the C-bit position. If needed,
additional SME-relevant info can be placed in that variable.
For example::
[ misc ][ enc bit ][ other misc SME info ]
0000_0000_0000_0000_1000_0000_0000_0000_0000_0000_..._0000
63 59 55 51 47 43 39 35 31 27 ... 3
x86_32
======
X86_PAE
-------
Denotes whether physical address extensions are enabled. It has the cost
of a higher page table lookup overhead, and also consumes more page
table space per process. Used to check whether PAE was enabled in the
crash kernel when converting virtual addresses to physical addresses.
ia64
====
pgdat_list|(pgdat_list, MAX_NUMNODES)
-------------------------------------
pg_data_t array storing all NUMA nodes information. MAX_NUMNODES
indicates the number of the nodes.
node_memblk|(node_memblk, NR_NODE_MEMBLKS)
------------------------------------------
List of node memory chunks. Filled when parsing the SRAT table to obtain
information about memory nodes. NR_NODE_MEMBLKS indicates the number of
node memory chunks.
These values are used to compute the number of nodes the crashed kernel used.
node_memblk_s|(node_memblk_s, start_paddr)|(node_memblk_s, size)
----------------------------------------------------------------
The size of a struct node_memblk_s and the offsets of the
node_memblk_s's members. Used to compute the number of nodes.
PGTABLE_3|PGTABLE_4
-------------------
User-space tools need to know whether the crash kernel was in 3-level or
4-level paging mode. Used to distinguish the page table.
ARM64
=====
VA_BITS
-------
The maximum number of bits for virtual addresses. Used to compute the
virtual memory ranges.
kimage_voffset
--------------
The offset between the kernel virtual and physical mappings. Used to
translate virtual to physical addresses.
PHYS_OFFSET
-----------
Indicates the physical address of the start of memory. Similar to
kimage_voffset, which is used to translate virtual to physical
addresses.
KERNELOFFSET
------------
The kernel randomization offset. Used to compute the page offset. If
KASLR is disabled, this value is zero.
arm
===
ARM_LPAE
--------
It indicates whether the crash kernel supports large physical address
extensions. Used to translate virtual to physical addresses.
s390
====
lowcore_ptr
-----------
An array with a pointer to the lowcore of every CPU. Used to print the
psw and all registers information.
high_memory
-----------
Used to get the vmalloc_start address from the high_memory symbol.
(lowcore_ptr, NR_CPUS)
----------------------
The maximum number of CPUs.
powerpc
=======
node_data|(node_data, MAX_NUMNODES)
-----------------------------------
See above.
contig_page_data
----------------
See above.
vmemmap_list
------------
The vmemmap_list maintains the entire vmemmap physical mapping. Used
to get vmemmap list count and populated vmemmap regions info. If the
vmemmap address translation information is stored in the crash kernel,
it is used to translate vmemmap kernel virtual addresses.
mmu_vmemmap_psize
-----------------
The size of a page. Used to translate virtual to physical addresses.
mmu_psize_defs
--------------
Page size definitions, i.e. 4k, 64k, or 16M.
Used to make vtop translations.
vmemmap_backing|(vmemmap_backing, list)|(vmemmap_backing, phys)|(vmemmap_backing, virt_addr)
--------------------------------------------------------------------------------------------
The vmemmap virtual address space management does not have a traditional
page table to track which virtual struct pages are backed by a physical
mapping. The virtual to physical mappings are tracked in a simple linked
list format.
User-space tools need to know the offset of list, phys and virt_addr
when computing the count of vmemmap regions.
mmu_psize_def|(mmu_psize_def, shift)
------------------------------------
The size of a struct mmu_psize_def and the offset of mmu_psize_def's
member.
Used in vtop translations.
sh
==
node_data|(node_data, MAX_NUMNODES)
-----------------------------------
See above.
X2TLB
-----
Indicates whether the crashed kernel enabled SH extended mode.

View File

@@ -9,11 +9,11 @@ and sorted into English Dictionary order (defined as ignoring all
punctuation and sorting digits before letters in a case insensitive
manner), and with descriptions where known.
The kernel parses parameters from the kernel command line up to "--";
The kernel parses parameters from the kernel command line up to "``--``";
if it doesn't recognize a parameter and it doesn't contain a '.', the
parameter gets passed to init: parameters with '=' go into init's
environment, others are passed as command line arguments to init.
Everything after "--" is passed as an argument to init.
Everything after "``--``" is passed as an argument to init.
Module parameters can be specified in two ways: via the kernel command
line with a module name prefix, or via modprobe, e.g.::
@@ -118,7 +118,7 @@ parameter is applicable::
LOOP Loopback device support is enabled.
M68k M68k architecture is enabled.
These options have more detailed description inside of
Documentation/m68k/kernel-options.txt.
Documentation/m68k/kernel-options.rst.
MDA MDA console support is enabled.
MIPS MIPS architecture is enabled.
MOUSE Appropriate mouse support is enabled.
@@ -167,7 +167,7 @@ parameter is applicable::
X86-32 X86-32, aka i386 architecture is enabled.
X86-64 X86-64 architecture is enabled.
More X86-64 boot options can be found in
Documentation/x86/x86_64/boot-options.txt .
Documentation/x86/x86_64/boot-options.rst.
X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64)
X86_UV SGI UV support is enabled.
XEN Xen support is enabled
@@ -181,10 +181,10 @@ In addition, the following text indicates that the option::
Parameters denoted with BOOT are actually interpreted by the boot
loader, and have no meaning to the kernel directly.
Do not modify the syntax of boot loader parameters without extreme
need or coordination with <Documentation/x86/boot.txt>.
need or coordination with <Documentation/x86/boot.rst>.
There are also arch-specific kernel-parameters not documented here.
See for example <Documentation/x86/x86_64/boot-options.txt>.
See for example <Documentation/x86/x86_64/boot-options.rst>.
Note that ALL kernel parameters listed below are CASE SENSITIVE, and that
a trailing = on the name of any parameter states that that parameter will

View File

@@ -13,7 +13,7 @@
For ARM64, ONLY "acpi=off", "acpi=on" or "acpi=force"
are available
See also Documentation/power/runtime_pm.txt, pci=noacpi
See also Documentation/power/runtime_pm.rst, pci=noacpi
acpi_apic_instance= [ACPI, IOAPIC]
Format: <int>
@@ -53,7 +53,7 @@
ACPI_DEBUG_PRINT statements, e.g.,
ACPI_DEBUG_PRINT((ACPI_DB_INFO, ...
The debug_level mask defaults to "info". See
Documentation/acpi/debug.txt for more information about
Documentation/firmware-guide/acpi/debug.rst for more information about
debug layers and levels.
Enable processor driver info messages:
@@ -223,7 +223,7 @@
acpi_sleep= [HW,ACPI] Sleep options
Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig,
old_ordering, nonvs, sci_force_enable, nobl }
See Documentation/power/video.txt for information on
See Documentation/power/video.rst for information on
s3_bios and s3_mode.
s3_beep is for debugging; it makes the PC's speaker beep
as soon as the kernel's real-mode entry point is called.
@@ -430,7 +430,7 @@
blkdevparts= Manual partition parsing of block device(s) for
embedded devices based on command line input.
See Documentation/block/cmdline-partition.txt
See Documentation/block/cmdline-partition.rst
boot_delay= Milliseconds to delay each printk during boot.
Values larger than 10 seconds (10000) are changed to
@@ -708,14 +708,14 @@
[KNL, x86_64] select a region under 4G first, and
fall back to reserve region above 4G when '@offset'
hasn't been specified.
See Documentation/kdump/kdump.txt for further details.
See Documentation/admin-guide/kdump/kdump.rst for further details.
crashkernel=range1:size1[,range2:size2,...][@offset]
[KNL] Same as above, but depends on the memory
in the running system. The syntax of range is
start-[end] where start and end are both
a memory unit (amount[KMG]). See also
Documentation/kdump/kdump.txt for an example.
Documentation/admin-guide/kdump/kdump.rst for an example.
crashkernel=size[KMG],high
[KNL, x86_64] range could be above 4G. Allow kernel
@@ -805,12 +805,10 @@
tracking down these problems.
debug_pagealloc=
[KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
parameter enables the feature at boot time. In
default, it is disabled. We can avoid allocating huge
chunk of memory for debug pagealloc if we don't enable
it at boot time and the system will work mostly same
with the kernel built without CONFIG_DEBUG_PAGEALLOC.
[KNL] When CONFIG_DEBUG_PAGEALLOC is set, this parameter
enables the feature at boot time. By default, it is
disabled and the system will work mostly the same as a
kernel built without CONFIG_DEBUG_PAGEALLOC.
on: enable the feature
debugpat [X86] Enable PAT debugging
@@ -932,7 +930,7 @@
edid/1680x1050.bin, or edid/1920x1080.bin is given
and no file with the same name exists. Details and
instructions how to build your own EDID data are
available in Documentation/EDID/HOWTO.txt. An EDID
available in Documentation/driver-api/edid.rst. An EDID
data set will only be used for a particular connector,
if its name and a colon are prepended to the EDID
name. Each connector may use a unique EDID data
@@ -963,7 +961,7 @@
for details.
nompx [X86] Disables Intel Memory Protection Extensions.
See Documentation/x86/intel_mpx.txt for more
See Documentation/x86/intel_mpx.rst for more
information about the feature.
nopku [X86] Disable Memory Protection Keys CPU feature found
@@ -1189,7 +1187,7 @@
that is to be dynamically loaded by Linux. If there are
multiple variables with the same name but with different
vendor GUIDs, all of them will be loaded. See
Documentation/acpi/ssdt-overlays.txt for details.
Documentation/admin-guide/acpi/ssdt-overlays.rst for details.
eisa_irq_edge= [PARISC,HW]
@@ -1201,15 +1199,15 @@
elevator= [IOSCHED]
Format: { "mq-deadline" | "kyber" | "bfq" }
See Documentation/block/deadline-iosched.txt,
Documentation/block/kyber-iosched.txt and
Documentation/block/bfq-iosched.txt for details.
See Documentation/block/deadline-iosched.rst,
Documentation/block/kyber-iosched.rst and
Documentation/block/bfq-iosched.rst for details.
elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390]
Specifies physical address of start of kernel core
image elf header and optionally the size. Generally
kexec loader will pass this option to capture kernel.
See Documentation/kdump/kdump.txt for details.
See Documentation/admin-guide/kdump/kdump.rst for details.
enable_mtrr_cleanup [X86]
The kernel tries to adjust MTRR layout from continuous
@@ -1251,7 +1249,7 @@
See also Documentation/fault-injection/.
floppy= [HW]
See Documentation/blockdev/floppy.txt.
See Documentation/admin-guide/blockdev/floppy.rst.
force_pal_cache_flush
[IA-64] Avoid check_sal_cache_flush which may hang on
@@ -1388,9 +1386,6 @@
Valid parameters: "on", "off"
Default: "on"
hisax= [HW,ISDN]
See Documentation/isdn/README.HiSax.
hlt [BUGS=ARM,SH]
hpet= [X86-32,HPET] option to control HPET usage
@@ -1507,7 +1502,7 @@
Format: =0.0 to prevent dma on hda, =0.1 hdb =1.0 hdc
.vlb_clock .pci_clock .noflush .nohpa .noprobe .nowerr
.cdrom .chs .ignore_cable are additional options
See Documentation/ide/ide.txt.
See Documentation/ide/ide.rst.
ide-generic.probe-mask= [HW] (E)IDE subsystem
Format: <int>
@@ -1673,6 +1668,15 @@
initrd= [BOOT] Specify the location of the initial ramdisk
init_on_alloc= [MM] Fill newly allocated pages and heap objects with
zeroes.
Format: 0 | 1
Default set by CONFIG_INIT_ON_ALLOC_DEFAULT_ON.
init_on_free= [MM] Fill freed pages and heap objects with zeroes.
Format: 0 | 1
Default set by CONFIG_INIT_ON_FREE_DEFAULT_ON.
init_pkru= [x86] Specify the default memory protection keys rights
register contents for all processes. 0x55555554 by
default (disallow access to all but pkey 0). Can
@@ -2007,6 +2011,19 @@
Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y,
the default is off.
kprobe_event=[probe-list]
[FTRACE] Add kprobe events and enable at boot time.
The probe-list is a semicolon delimited list of probe
definitions. Each definition is same as kprobe_events
interface, but the parameters are comma delimited.
For example, to add a kprobe event on vfs_read with
arg1 and arg2, add to the command line;
kprobe_event=p,vfs_read,$arg1,$arg2
See also Documentation/trace/kprobetrace.rst "Kernel
Boot Parameter" section.
kpti= [ARM64] Control page table isolation of user
and kernel address spaces.
Default: enabled on cores which need mitigation.
@@ -2230,7 +2247,7 @@
memblock=debug [KNL] Enable memblock debug messages.
load_ramdisk= [RAM] List of ramdisks to load from floppy
See Documentation/blockdev/ramdisk.txt.
See Documentation/admin-guide/blockdev/ramdisk.rst.
lockd.nlm_grace_period=P [NFS] Assign grace period.
Format: <integer>
@@ -2383,7 +2400,7 @@
mce [X86-32] Machine Check Exception
mce=option [X86-64] See Documentation/x86/x86_64/boot-options.txt
mce=option [X86-64] See Documentation/x86/x86_64/boot-options.rst
md= [HW] RAID subsystems devices and level
See Documentation/admin-guide/md.rst.
@@ -2439,7 +2456,7 @@
set according to the
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
option.
See Documentation/memory-hotplug.txt.
See Documentation/admin-guide/mm/memory-hotplug.rst.
memmap=exactmap [KNL,X86] Enable setting of an exact
E820 memory map, as specified by the user.
@@ -2528,7 +2545,7 @@
mem_encrypt=on: Activate SME
mem_encrypt=off: Do not activate SME
Refer to Documentation/x86/amd-memory-encryption.txt
Refer to Documentation/virt/kvm/amd-memory-encryption.rst
for details on when memory encryption can be activated.
mem_sleep_default= [SUSPEND] Default system suspend mode:
@@ -2836,8 +2853,9 @@
0 - turn hardlockup detector in nmi_watchdog off
1 - turn hardlockup detector in nmi_watchdog on
When panic is specified, panic when an NMI watchdog
timeout occurs (or 'nopanic' to override the opposite
default). To disable both hard and soft lockup detectors,
timeout occurs (or 'nopanic' to not panic on an NMI
watchdog, if CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is set)
To disable both hard and soft lockup detectors,
please see 'nowatchdog'.
This is useful when you use a panic=... timeout and
need the box quickly up again.
@@ -2872,6 +2890,17 @@
/sys/module/printk/parameters/console_suspend) to
turn on/off it dynamically.
novmcoredd [KNL,KDUMP]
Disable device dump. Device dump allows drivers to
append dump data to vmcore so you can collect driver
specified debug info. Drivers can append the data
without any limit and this data is stored in memory,
so this may cause significant memory stress. Disabling
device dump can help save memory but the driver debug
data will be no longer available. This parameter
is only available when CONFIG_PROC_VMCORE_DEVICE_DUMP
is set.
noaliencache [MM, NUMA, SLAB] Disables the allocation of alien
caches in the slab allocator. Saves per-node memory,
but will impact performance.
@@ -2927,7 +2956,7 @@
register save and restore. The kernel will only save
legacy floating-point registers on task switch.
nohugeiomap [KNL,x86] Disable kernel huge I/O mappings.
nohugeiomap [KNL,x86,PPC] Disable kernel huge I/O mappings.
nosmt [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
@@ -3139,7 +3168,7 @@
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
'node', 'default' can be specified
This can be set from sysctl after boot.
See Documentation/sysctl/vm.txt for details.
See Documentation/admin-guide/sysctl/vm.rst for details.
ohci1394_dma=early [HW] enable debugging via the ohci1394 driver.
See Documentation/debugging-via-ohci1394.txt for more
@@ -3263,7 +3292,7 @@
pcd. [PARIDE]
See header of drivers/block/paride/pcd.c.
See also Documentation/blockdev/paride.txt.
See also Documentation/admin-guide/blockdev/paride.rst.
pci=option[,option...] [PCI] various PCI subsystem options.
@@ -3507,7 +3536,7 @@
needed on a platform with proper driver support.
pd. [PARIDE]
See Documentation/blockdev/paride.txt.
See Documentation/admin-guide/blockdev/paride.rst.
pdcchassis= [PARISC,HW] Disable/Enable PDC Chassis Status codes at
boot time.
@@ -3522,13 +3551,13 @@
and performance comparison.
pf. [PARIDE]
See Documentation/blockdev/paride.txt.
See Documentation/admin-guide/blockdev/paride.rst.
pg. [PARIDE]
See Documentation/blockdev/paride.txt.
See Documentation/admin-guide/blockdev/paride.rst.
pirq= [SMP,APIC] Manual mp-table setup
See Documentation/x86/i386/IO-APIC.txt.
See Documentation/x86/i386/IO-APIC.rst.
plip= [PPT,NET] Parallel port network link
Format: { parport<nr> | timid | 0 }
@@ -3637,7 +3666,7 @@
prompt_ramdisk= [RAM] List of RAM disks to prompt for floppy disk
before loading.
See Documentation/blockdev/ramdisk.txt.
See Documentation/admin-guide/blockdev/ramdisk.rst.
psi= [KNL] Enable or disable pressure stall information
tracking.
@@ -3659,7 +3688,7 @@
pstore.backend= Specify the name of the pstore backend to use
pt. [PARIDE]
See Documentation/blockdev/paride.txt.
See Documentation/admin-guide/blockdev/paride.rst.
pti= [X86_64] Control Page Table Isolation of user and
kernel address spaces. Disabling this feature
@@ -3688,7 +3717,7 @@
See Documentation/admin-guide/md.rst.
ramdisk_size= [RAM] Sizes of RAM disks in kilobytes
See Documentation/blockdev/ramdisk.txt.
See Documentation/admin-guide/blockdev/ramdisk.rst.
random.trust_cpu={on,off}
[KNL] Enable or disable trusting the use of the
@@ -4084,7 +4113,7 @@
relax_domain_level=
[KNL, SMP] Set scheduler's default relax_domain_level.
See Documentation/cgroup-v1/cpusets.txt.
See Documentation/admin-guide/cgroup-v1/cpusets.rst.
reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory
Format: <base1>,<size1>[,<base2>,<size2>,...]
@@ -4114,7 +4143,7 @@
Specify the offset from the beginning of the partition
given by "resume=" at which the swap header is located,
in <PAGE_SIZE> units (needed only for swap files).
See Documentation/power/swsusp-and-swap-files.txt
See Documentation/power/swsusp-and-swap-files.rst
resumedelay= [HIBERNATION] Delay (in seconds) to pause before attempting to
read the resume files
@@ -4342,7 +4371,7 @@
Format: <integer>
sonypi.*= [HW] Sony Programmable I/O Control Device driver
See Documentation/laptops/sonypi.txt
See Documentation/admin-guide/laptops/sonypi.rst
spectre_v2= [X86] Control mitigation of Spectre variant 2
(indirect branch speculation) vulnerability.
@@ -4594,7 +4623,7 @@
swapaccount=[0|1]
[KNL] Enable accounting of swap in memory resource
controller if no parameter or 1 is given or disable
it if 0 is given (See Documentation/cgroup-v1/memory.txt)
it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst)
swiotlb= [ARM,IA-64,PPC,MIPS,X86]
Format: { <int> | force | noforce }
@@ -4669,27 +4698,6 @@
Force threading of all interrupt handlers except those
marked explicitly IRQF_NO_THREAD.
tmem [KNL,XEN]
Enable the Transcendent memory driver if built-in.
tmem.cleancache=0|1 [KNL, XEN]
Default is on (1). Disable the usage of the cleancache
API to send anonymous pages to the hypervisor.
tmem.frontswap=0|1 [KNL, XEN]
Default is on (1). Disable the usage of the frontswap
API to send swap pages to the hypervisor. If disabled
the selfballooning and selfshrinking are force disabled.
tmem.selfballooning=0|1 [KNL, XEN]
Default is on (1). Disable the driving of swap pages
to the hypervisor.
tmem.selfshrinking=0|1 [KNL, XEN]
Default is on (1). Partial swapoff that immediately
transfers pages from Xen hypervisor back to the
kernel based on different criteria.
topology= [S390]
Format: {off | on}
Specify if the kernel should make use of the cpu
@@ -5032,7 +5040,7 @@
vector=percpu: enable percpu vector domain
video= [FB] Frame buffer configuration
See Documentation/fb/modedb.txt.
See Documentation/fb/modedb.rst.
video.brightness_switch_enabled= [0,1]
If set to 1, on receiving an ACPI notify event
@@ -5060,8 +5068,8 @@
Can be used multiple times for multiple devices.
vga= [BOOT,X86-32] Select a particular video mode
See Documentation/x86/boot.txt and
Documentation/svga.txt.
See Documentation/x86/boot.rst and
Documentation/admin-guide/svga.rst.
Use vga=ask for menu.
This is actually a boot loader parameter; the value is
passed to the kernel using a special protocol.
@@ -5167,7 +5175,7 @@
Default: 3 = cyan.
watchdog timers [HW,WDT] For information on watchdog timers,
see Documentation/watchdog/watchdog-parameters.txt
see Documentation/watchdog/watchdog-parameters.rst
or other driver-specific files in the
Documentation/watchdog/ directory.
@@ -5259,6 +5267,8 @@
xen_nopv [X86]
Disables the PV optimizations forcing the HVM guest to
run as generic HVM guest with no PV drivers.
This option is obsoleted by the "nopv" option, which
has equivalent effect for XEN platform.
xen_scrub_pages= [XEN]
Boolean option to control scrubbing pages before giving them back
@@ -5273,10 +5283,24 @@
improve timer resolution at the expense of processing
more timer interrupts.
nopv= [X86,XEN,KVM,HYPER_V,VMWARE]
Disables the PV optimizations forcing the guest to run
as generic guest with no PV drivers. Currently support
XEN HVM, KVM, HYPER_V and VMWARE guest.
xirc2ps_cs= [NET,PCMCIA]
Format:
<irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]]
xive= [PPC]
By default on POWER9 and above, the kernel will
natively use the XIVE interrupt controller. This option
allows the fallback firmware mode to be used:
off Fallback to firmware control of XIVE interrupt
controller on both pseries and powernv
platforms. Only useful on POWER9 and above.
xhci-hcd.quirks [USB,KNL]
A hex value specifying bitmask with supplemental xhci
host controller quirks. Meaning of each bit can be

View File

@@ -0,0 +1,356 @@
==========================================
Reducing OS jitter due to per-cpu kthreads
==========================================
This document lists per-CPU kthreads in the Linux kernel and presents
options to control their OS jitter. Note that non-per-CPU kthreads are
not listed here. To reduce OS jitter from non-per-CPU kthreads, bind
them to a "housekeeping" CPU dedicated to such work.
References
==========
- Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs.
- Documentation/admin-guide/cgroup-v1: Using cgroups to bind tasks to sets of CPUs.
- man taskset: Using the taskset command to bind tasks to sets
of CPUs.
- man sched_setaffinity: Using the sched_setaffinity() system
call to bind tasks to sets of CPUs.
- /sys/devices/system/cpu/cpuN/online: Control CPU N's hotplug state,
writing "0" to offline and "1" to online.
- In order to locate kernel-generated OS jitter on CPU N:
cd /sys/kernel/debug/tracing
echo 1 > max_graph_depth # Increase the "1" for more detail
echo function_graph > current_tracer
# run workload
cat per_cpu/cpuN/trace
kthreads
========
Name:
ehca_comp/%u
Purpose:
Periodically process Infiniband-related work.
To reduce its OS jitter, do any of the following:
1. Don't use eHCA Infiniband hardware, instead choosing hardware
that does not require per-CPU kthreads. This will prevent these
kthreads from being created in the first place. (This will
work for most people, as this hardware, though important, is
relatively old and is produced in relatively low unit volumes.)
2. Do all eHCA-Infiniband-related work on other CPUs, including
interrupts.
3. Rework the eHCA driver so that its per-CPU kthreads are
provisioned only on selected CPUs.
Name:
irq/%d-%s
Purpose:
Handle threaded interrupts.
To reduce its OS jitter, do the following:
1. Use irq affinity to force the irq threads to execute on
some other CPU.
Name:
kcmtpd_ctr_%d
Purpose:
Handle Bluetooth work.
To reduce its OS jitter, do one of the following:
1. Don't use Bluetooth, in which case these kthreads won't be
created in the first place.
2. Use irq affinity to force Bluetooth-related interrupts to
occur on some other CPU and furthermore initiate all
Bluetooth activity on some other CPU.
Name:
ksoftirqd/%u
Purpose:
Execute softirq handlers when threaded or when under heavy load.
To reduce its OS jitter, each softirq vector must be handled
separately as follows:
TIMER_SOFTIRQ
-------------
Do all of the following:
1. To the extent possible, keep the CPU out of the kernel when it
is non-idle, for example, by avoiding system calls and by forcing
both kernel threads and interrupts to execute elsewhere.
2. Build with CONFIG_HOTPLUG_CPU=y. After boot completes, force
the CPU offline, then bring it back online. This forces
recurring timers to migrate elsewhere. If you are concerned
with multiple CPUs, force them all offline before bringing the
first one back online. Once you have onlined the CPUs in question,
do not offline any other CPUs, because doing so could force the
timer back onto one of the CPUs in question.
NET_TX_SOFTIRQ and NET_RX_SOFTIRQ
---------------------------------
Do all of the following:
1. Force networking interrupts onto other CPUs.
2. Initiate any network I/O on other CPUs.
3. Once your application has started, prevent CPU-hotplug operations
from being initiated from tasks that might run on the CPU to
be de-jittered. (It is OK to force this CPU offline and then
bring it back online before you start your application.)
BLOCK_SOFTIRQ
-------------
Do all of the following:
1. Force block-device interrupts onto some other CPU.
2. Initiate any block I/O on other CPUs.
3. Once your application has started, prevent CPU-hotplug operations
from being initiated from tasks that might run on the CPU to
be de-jittered. (It is OK to force this CPU offline and then
bring it back online before you start your application.)
IRQ_POLL_SOFTIRQ
----------------
Do all of the following:
1. Force block-device interrupts onto some other CPU.
2. Initiate any block I/O and block-I/O polling on other CPUs.
3. Once your application has started, prevent CPU-hotplug operations
from being initiated from tasks that might run on the CPU to
be de-jittered. (It is OK to force this CPU offline and then
bring it back online before you start your application.)
TASKLET_SOFTIRQ
---------------
Do one or more of the following:
1. Avoid use of drivers that use tasklets. (Such drivers will contain
calls to things like tasklet_schedule().)
2. Convert all drivers that you must use from tasklets to workqueues.
3. Force interrupts for drivers using tasklets onto other CPUs,
and also do I/O involving these drivers on other CPUs.
SCHED_SOFTIRQ
-------------
Do all of the following:
1. Avoid sending scheduler IPIs to the CPU to be de-jittered,
for example, ensure that at most one runnable kthread is present
on that CPU. If a thread that expects to run on the de-jittered
CPU awakens, the scheduler will send an IPI that can result in
a subsequent SCHED_SOFTIRQ.
2. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be de-jittered
is marked as an adaptive-ticks CPU using the "nohz_full="
boot parameter. This reduces the number of scheduler-clock
interrupts that the de-jittered CPU receives, minimizing its
chances of being selected to do the load balancing work that
runs in SCHED_SOFTIRQ context.
3. To the extent possible, keep the CPU out of the kernel when it
is non-idle, for example, by avoiding system calls and by
forcing both kernel threads and interrupts to execute elsewhere.
This further reduces the number of scheduler-clock interrupts
received by the de-jittered CPU.
HRTIMER_SOFTIRQ
---------------
Do all of the following:
1. To the extent possible, keep the CPU out of the kernel when it
is non-idle. For example, avoid system calls and force both
kernel threads and interrupts to execute elsewhere.
2. Build with CONFIG_HOTPLUG_CPU=y. Once boot completes, force the
CPU offline, then bring it back online. This forces recurring
timers to migrate elsewhere. If you are concerned with multiple
CPUs, force them all offline before bringing the first one
back online. Once you have onlined the CPUs in question, do not
offline any other CPUs, because doing so could force the timer
back onto one of the CPUs in question.
RCU_SOFTIRQ
-----------
Do at least one of the following:
1. Offload callbacks and keep the CPU in either dyntick-idle or
adaptive-ticks state by doing all of the following:
a. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be
de-jittered is marked as an adaptive-ticks CPU using the
"nohz_full=" boot parameter. Bind the rcuo kthreads to
housekeeping CPUs, which can tolerate OS jitter.
b. To the extent possible, keep the CPU out of the kernel
when it is non-idle, for example, by avoiding system
calls and by forcing both kernel threads and interrupts
to execute elsewhere.
2. Enable RCU to do its processing remotely via dyntick-idle by
doing all of the following:
a. Build with CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y.
b. Ensure that the CPU goes idle frequently, allowing other
CPUs to detect that it has passed through an RCU quiescent
state. If the kernel is built with CONFIG_NO_HZ_FULL=y,
userspace execution also allows other CPUs to detect that
the CPU in question has passed through a quiescent state.
c. To the extent possible, keep the CPU out of the kernel
when it is non-idle, for example, by avoiding system
calls and by forcing both kernel threads and interrupts
to execute elsewhere.
Name:
kworker/%u:%d%s (cpu, id, priority)
Purpose:
Execute workqueue requests
To reduce its OS jitter, do any of the following:
1. Run your workload at a real-time priority, which will allow
preempting the kworker daemons.
2. A given workqueue can be made visible in the sysfs filesystem
by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
Such a workqueue can be confined to a given subset of the
CPUs using the ``/sys/devices/virtual/workqueue/*/cpumask`` sysfs
files. The set of WQ_SYSFS workqueues can be displayed using
"ls sys/devices/virtual/workqueue". That said, the workqueues
maintainer would like to caution people against indiscriminately
sprinkling WQ_SYSFS across all the workqueues. The reason for
caution is that it is easy to add WQ_SYSFS, but because sysfs is
part of the formal user/kernel API, it can be nearly impossible
to remove it, even if its addition was a mistake.
3. Do any of the following needed to avoid jitter that your
application cannot tolerate:
a. Build your kernel with CONFIG_SLUB=y rather than
CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
use of each CPU's workqueues to run its cache_reap()
function.
b. Avoid using oprofile, thus avoiding OS jitter from
wq_sync_buffer().
c. Limit your CPU frequency so that a CPU-frequency
governor is not required, possibly enlisting the aid of
special heatsinks or other cooling technologies. If done
correctly, and if you CPU architecture permits, you should
be able to build your kernel with CONFIG_CPU_FREQ=n to
avoid the CPU-frequency governor periodically running
on each CPU, including cs_dbs_timer() and od_dbs_timer().
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
d. As of v3.18, Christoph Lameter's on-demand vmstat workers
commit prevents OS jitter due to vmstat_update() on
CONFIG_SMP=y systems. Before v3.18, is not possible
to entirely get rid of the OS jitter, but you can
decrease its frequency by writing a large value to
/proc/sys/vm/stat_interval. The default value is HZ,
for an interval of one second. Of course, larger values
will make your virtual-memory statistics update more
slowly. Of course, you can also run your workload at
a real-time priority, thus preempting vmstat_update(),
but if your workload is CPU-bound, this is a bad idea.
However, there is an RFC patch from Christoph Lameter
(based on an earlier one from Gilad Ben-Yossef) that
reduces or even eliminates vmstat overhead for some
workloads at https://lkml.org/lkml/2013/9/4/379.
e. Boot with "elevator=noop" to avoid workqueue use by
the block layer.
f. If running on high-end powerpc servers, build with
CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS
daemon from running on each CPU every second or so.
(This will require editing Kconfig files and will defeat
this platform's RAS functionality.) This avoids jitter
due to the rtas_event_scan() function.
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
g. If running on Cell Processor, build your kernel with
CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
spu_gov_work().
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
h. If running on PowerMAC, build your kernel with
CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
avoiding OS jitter from rackmeter_do_timer().
Name:
rcuc/%u
Purpose:
Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels.
To reduce its OS jitter, do at least one of the following:
1. Build the kernel with CONFIG_PREEMPT=n. This prevents these
kthreads from being created in the first place, and also obviates
the need for RCU priority boosting. This approach is feasible
for workloads that do not require high degrees of responsiveness.
2. Build the kernel with CONFIG_RCU_BOOST=n. This prevents these
kthreads from being created in the first place. This approach
is feasible only if your workload never requires RCU priority
boosting, for example, if you ensure frequent idle time on all
CPUs that might execute within the kernel.
3. Build with CONFIG_RCU_NOCB_CPU=y and boot with the rcu_nocbs=
boot parameter offloading RCU callbacks from all CPUs susceptible
to OS jitter. This approach prevents the rcuc/%u kthreads from
having any work to do, so that they are never awakened.
4. Ensure that the CPU never enters the kernel, and, in particular,
avoid initiating any CPU hotplug operations on this CPU. This is
another way of preventing any callbacks from being queued on the
CPU, again preventing the rcuc/%u kthreads from having any work
to do.
Name:
rcuop/%d and rcuos/%d
Purpose:
Offload RCU callbacks from the corresponding CPU.
To reduce its OS jitter, do at least one of the following:
1. Use affinity, cgroups, or other mechanism to force these kthreads
to execute on some other CPU.
2. Build with CONFIG_RCU_NOCB_CPU=n, which will prevent these
kthreads from being created in the first place. However, please
note that this will not eliminate OS jitter, but will instead
shift it to RCU_SOFTIRQ.
Name:
watchdog/%u
Purpose:
Detect software lockups on each CPU.
To reduce its OS jitter, do at least one of the following:
1. Build with CONFIG_LOCKUP_DETECTOR=n, which will prevent these
kthreads from being created in the first place.
2. Boot with "nosoftlockup=0", which will also prevent these kthreads
from being created. Other related watchdog and softlockup boot
parameters may be found in Documentation/admin-guide/kernel-parameters.rst
and Documentation/watchdog/watchdog-parameters.rst.
3. Echo a zero to /proc/sys/kernel/watchdog to disable the
watchdog timer.
4. Echo a large number of /proc/sys/kernel/watchdog_thresh in
order to reduce the frequency of OS jitter due to the watchdog
timer down to a level that is acceptable for your workload.

View File

@@ -0,0 +1,271 @@
==================
Asus Laptop Extras
==================
Version 0.1
August 6, 2009
Corentin Chary <corentincj@iksaif.net>
http://acpi4asus.sf.net/
This driver provides support for extra features of ACPI-compatible ASUS laptops.
It may also support some MEDION, JVC or VICTOR laptops (such as MEDION 9675 or
VICTOR XP7210 for example). It makes all the extra buttons generate input
events (like keyboards).
On some models adds support for changing the display brightness and output,
switching the LCD backlight on and off, and most importantly, allows you to
blink those fancy LEDs intended for reporting mail and wireless status.
This driver supersedes the old asus_acpi driver.
Requirements
------------
Kernel 2.6.X sources, configured for your computer, with ACPI support.
You also need CONFIG_INPUT and CONFIG_ACPI.
Status
------
The features currently supported are the following (see below for
detailed description):
- Fn key combinations
- Bluetooth enable and disable
- Wlan enable and disable
- GPS enable and disable
- Video output switching
- Ambient Light Sensor on and off
- LED control
- LED Display control
- LCD brightness control
- LCD on and off
A compatibility table by model and feature is maintained on the web
site, http://acpi4asus.sf.net/.
Usage
-----
Try "modprobe asus-laptop". Check your dmesg (simply type dmesg). You should
see some lines like this :
Asus Laptop Extras version 0.42
- L2D model detected.
If it is not the output you have on your laptop, send it (and the laptop's
DSDT) to me.
That's all, now, all the events generated by the hotkeys of your laptop
should be reported via netlink events. You can check with
"acpi_genl monitor" (part of the acpica project).
Hotkeys are also reported as input keys (like keyboards) you can check
which key are supported using "xev" under X11.
You can get information on the version of your DSDT table by reading the
/sys/devices/platform/asus-laptop/infos entry. If you have a question or a
bug report to do, please include the output of this entry.
LEDs
----
You can modify LEDs be echoing values to `/sys/class/leds/asus/*/brightness`::
echo 1 > /sys/class/leds/asus::mail/brightness
will switch the mail LED on.
You can also know if they are on/off by reading their content and use
kernel triggers like disk-activity or heartbeat.
Backlight
---------
You can control lcd backlight power and brightness with
/sys/class/backlight/asus-laptop/. Brightness Values are between 0 and 15.
Wireless devices
----------------
You can turn the internal Bluetooth adapter on/off with the bluetooth entry
(only on models with Bluetooth). This usually controls the associated LED.
Same for Wlan adapter.
Display switching
-----------------
Note: the display switching code is currently considered EXPERIMENTAL.
Switching works for the following models:
- L3800C
- A2500H
- L5800C
- M5200N
- W1000N (albeit with some glitches)
- M6700R
- A6JC
- F3J
Switching doesn't work for the following:
- M3700N
- L2X00D (locks the laptop under certain conditions)
To switch the displays, echo values from 0 to 15 to
/sys/devices/platform/asus-laptop/display. The significance of those values
is as follows:
+-------+-----+-----+-----+-----+-----+
| Bin | Val | DVI | TV | CRT | LCD |
+-------+-----+-----+-----+-----+-----+
| 0000 | 0 | | | | |
+-------+-----+-----+-----+-----+-----+
| 0001 | 1 | | | | X |
+-------+-----+-----+-----+-----+-----+
| 0010 | 2 | | | X | |
+-------+-----+-----+-----+-----+-----+
| 0011 | 3 | | | X | X |
+-------+-----+-----+-----+-----+-----+
| 0100 | 4 | | X | | |
+-------+-----+-----+-----+-----+-----+
| 0101 | 5 | | X | | X |
+-------+-----+-----+-----+-----+-----+
| 0110 | 6 | | X | X | |
+-------+-----+-----+-----+-----+-----+
| 0111 | 7 | | X | X | X |
+-------+-----+-----+-----+-----+-----+
| 1000 | 8 | X | | | |
+-------+-----+-----+-----+-----+-----+
| 1001 | 9 | X | | | X |
+-------+-----+-----+-----+-----+-----+
| 1010 | 10 | X | | X | |
+-------+-----+-----+-----+-----+-----+
| 1011 | 11 | X | | X | X |
+-------+-----+-----+-----+-----+-----+
| 1100 | 12 | X | X | | |
+-------+-----+-----+-----+-----+-----+
| 1101 | 13 | X | X | | X |
+-------+-----+-----+-----+-----+-----+
| 1110 | 14 | X | X | X | |
+-------+-----+-----+-----+-----+-----+
| 1111 | 15 | X | X | X | X |
+-------+-----+-----+-----+-----+-----+
In most cases, the appropriate displays must be plugged in for the above
combinations to work. TV-Out may need to be initialized at boot time.
Debugging:
1) Check whether the Fn+F8 key:
a) does not lock the laptop (try a boot with noapic / nolapic if it does)
b) generates events (0x6n, where n is the value corresponding to the
configuration above)
c) actually works
Record the disp value at every configuration.
2) Echo values from 0 to 15 to /sys/devices/platform/asus-laptop/display.
Record its value, note any change. If nothing changes, try a broader range,
up to 65535.
3) Send ANY output (both positive and negative reports are needed, unless your
machine is already listed above) to the acpi4asus-user mailing list.
Note: on some machines (e.g. L3C), after the module has been loaded, only 0x6n
events are generated and no actual switching occurs. In such a case, a line
like::
echo $((10#$arg-60)) > /sys/devices/platform/asus-laptop/display
will usually do the trick ($arg is the 0000006n-like event passed to acpid).
Note: there is currently no reliable way to read display status on xxN
(Centrino) models.
LED display
-----------
Some models like the W1N have a LED display that can be used to display
several items of information.
LED display works for the following models:
- W1000N
- W1J
To control the LED display, use the following::
echo 0x0T000DDD > /sys/devices/platform/asus-laptop/
where T control the 3 letters display, and DDD the 3 digits display,
according to the tables below::
DDD (digits)
000 to 999 = display digits
AAA = ---
BBB to FFF = turn-off
T (type)
0 = off
1 = dvd
2 = vcd
3 = mp3
4 = cd
5 = tv
6 = cpu
7 = vol
For example "echo 0x01000001 >/sys/devices/platform/asus-laptop/ledd"
would display "DVD001".
Driver options
--------------
Options can be passed to the asus-laptop driver using the standard
module argument syntax (<param>=<value> when passing the option to the
module or asus-laptop.<param>=<value> on the kernel boot line when
asus-laptop is statically linked into the kernel).
wapf: WAPF defines the behavior of the Fn+Fx wlan key
The significance of values is yet to be found, but
most of the time:
- 0x0 should do nothing
- 0x1 should allow to control the device with Fn+Fx key.
- 0x4 should send an ACPI event (0x88) while pressing the Fn+Fx key
- 0x5 like 0x1 or 0x4
The default value is 0x1.
Unsupported models
------------------
These models will never be supported by this module, as they use a completely
different mechanism to handle LEDs and extra stuff (meaning we have no clue
how it works):
- ASUS A1300 (A1B), A1370D
- ASUS L7300G
- ASUS L8400
Patches, Errors, Questions
--------------------------
I appreciate any success or failure
reports, especially if they add to or correct the compatibility table.
Please include the following information in your report:
- Asus model name
- a copy of your ACPI tables, using the "acpidump" utility
- a copy of /sys/devices/platform/asus-laptop/infos
- which driver features work and which don't
- the observed behavior of non-working features
Any other comments or patches are also more than welcome.
acpi4asus-user@lists.sourceforge.net
http://sourceforge.net/projects/acpi4asus

View File

@@ -0,0 +1,151 @@
==========================
Hard disk shock protection
==========================
Author: Elias Oltmanns <eo@nebensachen.de>
Last modified: 2008-10-03
.. 0. Contents
1. Intro
2. The interface
3. References
4. CREDITS
1. Intro
--------
ATA/ATAPI-7 specifies the IDLE IMMEDIATE command with unload feature.
Issuing this command should cause the drive to switch to idle mode and
unload disk heads. This feature is being used in modern laptops in
conjunction with accelerometers and appropriate software to implement
a shock protection facility. The idea is to stop all I/O operations on
the internal hard drive and park its heads on the ramp when critical
situations are anticipated. The desire to have such a feature
available on GNU/Linux systems has been the original motivation to
implement a generic disk head parking interface in the Linux kernel.
Please note, however, that other components have to be set up on your
system in order to get disk shock protection working (see
section 3. References below for pointers to more information about
that).
2. The interface
----------------
For each ATA device, the kernel exports the file
`block/*/device/unload_heads` in sysfs (here assumed to be mounted under
/sys). Access to `/sys/block/*/device/unload_heads` is denied with
-EOPNOTSUPP if the device does not support the unload feature.
Otherwise, writing an integer value to this file will take the heads
of the respective drive off the platter and block all I/O operations
for the specified number of milliseconds. When the timeout expires and
no further disk head park request has been issued in the meantime,
normal operation will be resumed. The maximal value accepted for a
timeout is 30000 milliseconds. Exceeding this limit will return
-EOVERFLOW, but heads will be parked anyway and the timeout will be
set to 30 seconds. However, you can always change a timeout to any
value between 0 and 30000 by issuing a subsequent head park request
before the timeout of the previous one has expired. In particular, the
total timeout can exceed 30 seconds and, more importantly, you can
cancel a previously set timeout and resume normal operation
immediately by specifying a timeout of 0. Values below -2 are rejected
with -EINVAL (see below for the special meaning of -1 and -2). If the
timeout specified for a recent head park request has not yet expired,
reading from `/sys/block/*/device/unload_heads` will report the number
of milliseconds remaining until normal operation will be resumed;
otherwise, reading the unload_heads attribute will return 0.
For example, do the following in order to park the heads of drive
/dev/sda and stop all I/O operations for five seconds::
# echo 5000 > /sys/block/sda/device/unload_heads
A simple::
# cat /sys/block/sda/device/unload_heads
will show you how many milliseconds are left before normal operation
will be resumed.
A word of caution: The fact that the interface operates on a basis of
milliseconds may raise expectations that cannot be satisfied in
reality. In fact, the ATA specs clearly state that the time for an
unload operation to complete is vendor specific. The hint in ATA-7
that this will typically be within 500 milliseconds apparently has
been dropped in ATA-8.
There is a technical detail of this implementation that may cause some
confusion and should be discussed here. When a head park request has
been issued to a device successfully, all I/O operations on the
controller port this device is attached to will be deferred. That is
to say, any other device that may be connected to the same port will
be affected too. The only exception is that a subsequent head unload
request to that other device will be executed immediately. Further
operations on that port will be deferred until the timeout specified
for either device on the port has expired. As far as PATA (old style
IDE) configurations are concerned, there can only be two devices
attached to any single port. In SATA world we have port multipliers
which means that a user-issued head parking request to one device may
actually result in stopping I/O to a whole bunch of devices. However,
since this feature is supposed to be used on laptops and does not seem
to be very useful in any other environment, there will be mostly one
device per port. Even if the CD/DVD writer happens to be connected to
the same port as the hard drive, it generally *should* recover just
fine from the occasional buffer under-run incurred by a head park
request to the HD. Actually, when you are using an ide driver rather
than its libata counterpart (i.e. your disk is called /dev/hda
instead of /dev/sda), then parking the heads of one drive (drive X)
will generally not affect the mode of operation of another drive
(drive Y) on the same port as described above. It is only when a port
reset is required to recover from an exception on drive Y that further
I/O operations on that drive (and the reset itself) will be delayed
until drive X is no longer in the parked state.
Finally, there are some hard drives that only comply with an earlier
version of the ATA standard than ATA-7, but do support the unload
feature nonetheless. Unfortunately, there is no safe way Linux can
detect these devices, so you won't be able to write to the
unload_heads attribute. If you know that your device really does
support the unload feature (for instance, because the vendor of your
laptop or the hard drive itself told you so), then you can tell the
kernel to enable the usage of this feature for that drive by writing
the special value -1 to the unload_heads attribute::
# echo -1 > /sys/block/sda/device/unload_heads
will enable the feature for /dev/sda, and giving -2 instead of -1 will
disable it again.
3. References
-------------
There are several laptops from different vendors featuring shock
protection capabilities. As manufacturers have refused to support open
source development of the required software components so far, Linux
support for shock protection varies considerably between different
hardware implementations. Ideally, this section should contain a list
of pointers at different projects aiming at an implementation of shock
protection on different systems. Unfortunately, I only know of a
single project which, although still considered experimental, is fit
for use. Please feel free to add projects that have been the victims
of my ignorance.
- http://www.thinkwiki.org/wiki/HDAPS
See this page for information about Linux support of the hard disk
active protection system as implemented in IBM/Lenovo Thinkpads.
4. CREDITS
----------
This implementation of disk head parking has been inspired by a patch
originally published by Jon Escombe <lists@dresco.co.uk>. My efforts
to develop an implementation of this feature that is fit to be merged
into mainline have been aided by various kernel developers, in
particular by Tejun Heo and Bartlomiej Zolnierkiewicz.

View File

@@ -0,0 +1,17 @@
.. SPDX-License-Identifier: GPL-2.0
==============
Laptop Drivers
==============
.. toctree::
:maxdepth: 1
asus-laptop
disk-shock-protection
laptop-mode
lg-laptop
sony-laptop
sonypi
thinkpad-acpi
toshiba_haps

View File

@@ -0,0 +1,781 @@
===============================================
How to conserve battery power using laptop-mode
===============================================
Document Author: Bart Samwel (bart@samwel.tk)
Date created: January 2, 2004
Last modified: December 06, 2004
Introduction
------------
Laptop mode is used to minimize the time that the hard disk needs to be spun up,
to conserve battery power on laptops. It has been reported to cause significant
power savings.
.. Contents
* Introduction
* Installation
* Caveats
* The Details
* Tips & Tricks
* Control script
* ACPI integration
* Monitoring tool
Installation
------------
To use laptop mode, you don't need to set any kernel configuration options
or anything. Simply install all the files included in this document, and
laptop mode will automatically be started when you're on battery. For
your convenience, a tarball containing an installer can be downloaded at:
http://www.samwel.tk/laptop_mode/laptop_mode/
To configure laptop mode, you need to edit the configuration file, which is
located in /etc/default/laptop-mode on Debian-based systems, or in
/etc/sysconfig/laptop-mode on other systems.
Unfortunately, automatic enabling of laptop mode does not work for
laptops that don't have ACPI. On those laptops, you need to start laptop
mode manually. To start laptop mode, run "laptop_mode start", and to
stop it, run "laptop_mode stop". (Note: The laptop mode tools package now
has experimental support for APM, you might want to try that first.)
Caveats
-------
* The downside of laptop mode is that you have a chance of losing up to 10
minutes of work. If you cannot afford this, don't use it! The supplied ACPI
scripts automatically turn off laptop mode when the battery almost runs out,
so that you won't lose any data at the end of your battery life.
* Most desktop hard drives have a very limited lifetime measured in spindown
cycles, typically about 50.000 times (it's usually listed on the spec sheet).
Check your drive's rating, and don't wear down your drive's lifetime if you
don't need to.
* If you mount some of your ext3/reiserfs filesystems with the -n option, then
the control script will not be able to remount them correctly. You must set
DO_REMOUNTS=0 in the control script, otherwise it will remount them with the
wrong options -- or it will fail because it cannot write to /etc/mtab.
* If you have your filesystems listed as type "auto" in fstab, like I did, then
the control script will not recognize them as filesystems that need remounting.
You must list the filesystems with their true type instead.
* It has been reported that some versions of the mutt mail client use file access
times to determine whether a folder contains new mail. If you use mutt and
experience this, you must disable the noatime remounting by setting the option
DO_REMOUNT_NOATIME to 0 in the configuration file.
The Details
-----------
Laptop mode is controlled by the knob /proc/sys/vm/laptop_mode. This knob is
present for all kernels that have the laptop mode patch, regardless of any
configuration options. When the knob is set, any physical disk I/O (that might
have caused the hard disk to spin up) causes Linux to flush all dirty blocks. The
result of this is that after a disk has spun down, it will not be spun up
anymore to write dirty blocks, because those blocks had already been written
immediately after the most recent read operation. The value of the laptop_mode
knob determines the time between the occurrence of disk I/O and when the flush
is triggered. A sensible value for the knob is 5 seconds. Setting the knob to
0 disables laptop mode.
To increase the effectiveness of the laptop_mode strategy, the laptop_mode
control script increases dirty_expire_centisecs and dirty_writeback_centisecs in
/proc/sys/vm to about 10 minutes (by default), which means that pages that are
dirtied are not forced to be written to disk as often. The control script also
changes the dirty background ratio, so that background writeback of dirty pages
is not done anymore. Combined with a higher commit value (also 10 minutes) for
ext3 or ReiserFS filesystems (also done automatically by the control script),
this results in concentration of disk activity in a small time interval which
occurs only once every 10 minutes, or whenever the disk is forced to spin up by
a cache miss. The disk can then be spun down in the periods of inactivity.
If you want to find out which process caused the disk to spin up, you can
gather information by setting the flag /proc/sys/vm/block_dump. When this flag
is set, Linux reports all disk read and write operations that take place, and
all block dirtyings done to files. This makes it possible to debug why a disk
needs to spin up, and to increase battery life even more. The output of
block_dump is written to the kernel output, and it can be retrieved using
"dmesg". When you use block_dump and your kernel logging level also includes
kernel debugging messages, you probably want to turn off klogd, otherwise
the output of block_dump will be logged, causing disk activity that is not
normally there.
Configuration
-------------
The laptop mode configuration file is located in /etc/default/laptop-mode on
Debian-based systems, or in /etc/sysconfig/laptop-mode on other systems. It
contains the following options:
MAX_AGE:
Maximum time, in seconds, of hard drive spindown time that you are
comfortable with. Worst case, it's possible that you could lose this
amount of work if your battery fails while you're in laptop mode.
MINIMUM_BATTERY_MINUTES:
Automatically disable laptop mode if the remaining number of minutes of
battery power is less than this value. Default is 10 minutes.
AC_HD/BATT_HD:
The idle timeout that should be set on your hard drive when laptop mode
is active (BATT_HD) and when it is not active (AC_HD). The defaults are
20 seconds (value 4) for BATT_HD and 2 hours (value 244) for AC_HD. The
possible values are those listed in the manual page for "hdparm" for the
"-S" option.
HD:
The devices for which the spindown timeout should be adjusted by laptop mode.
Default is /dev/hda. If you specify multiple devices, separate them by a space.
READAHEAD:
Disk readahead, in 512-byte sectors, while laptop mode is active. A large
readahead can prevent disk accesses for things like executable pages (which are
loaded on demand while the application executes) and sequentially accessed data
(MP3s).
DO_REMOUNTS:
The control script automatically remounts any mounted journaled filesystems
with appropriate commit interval options. When this option is set to 0, this
feature is disabled.
DO_REMOUNT_NOATIME:
When remounting, should the filesystems be remounted with the noatime option?
Normally, this is set to "1" (enabled), but there may be programs that require
access time recording.
DIRTY_RATIO:
The percentage of memory that is allowed to contain "dirty" or unsaved data
before a writeback is forced, while laptop mode is active. Corresponds to
the /proc/sys/vm/dirty_ratio sysctl.
DIRTY_BACKGROUND_RATIO:
The percentage of memory that is allowed to contain "dirty" or unsaved data
after a forced writeback is done due to an exceeding of DIRTY_RATIO. Set
this nice and low. This corresponds to the /proc/sys/vm/dirty_background_ratio
sysctl.
Note that the behaviour of dirty_background_ratio is quite different
when laptop mode is active and when it isn't. When laptop mode is inactive,
dirty_background_ratio is the threshold percentage at which background writeouts
start taking place. When laptop mode is active, however, background writeouts
are disabled, and the dirty_background_ratio only determines how much writeback
is done when dirty_ratio is reached.
DO_CPU:
Enable CPU frequency scaling when in laptop mode. (Requires CPUFreq to be setup.
See Documentation/admin-guide/pm/cpufreq.rst for more info. Disabled by default.)
CPU_MAXFREQ:
When on battery, what is the maximum CPU speed that the system should use? Legal
values are "slowest" for the slowest speed that your CPU is able to operate at,
or a value listed in /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies.
Tips & Tricks
-------------
* Bartek Kania reports getting up to 50 minutes of extra battery life (on top
of his regular 3 to 3.5 hours) using a spindown time of 5 seconds (BATT_HD=1).
* You can spin down the disk while playing MP3, by setting disk readahead
to 8MB (READAHEAD=16384). Effectively, the disk will read a complete MP3 at
once, and will then spin down while the MP3 is playing. (Thanks to Bartek
Kania.)
* Drew Scott Daniels observed: "I don't know why, but when I decrease the number
of colours that my display uses it consumes less battery power. I've seen
this on powerbooks too. I hope that this is a piece of information that
might be useful to the Laptop Mode patch or its users."
* In syslog.conf, you can prefix entries with a dash `-` to omit syncing the
file after every logging. When you're using laptop-mode and your disk doesn't
spin down, this is a likely culprit.
* Richard Atterer observed that laptop mode does not work well with noflushd
(http://noflushd.sourceforge.net/), it seems that noflushd prevents laptop-mode
from doing its thing.
* If you're worried about your data, you might want to consider using a USB
memory stick or something like that as a "working area". (Be aware though
that flash memory can only handle a limited number of writes, and overuse
may wear out your memory stick pretty quickly. Do _not_ use journalling
filesystems on flash memory sticks.)
Configuration file for control and ACPI battery scripts
-------------------------------------------------------
This allows the tunables to be changed for the scripts via an external
configuration file
It should be installed as /etc/default/laptop-mode on Debian, and as
/etc/sysconfig/laptop-mode on Red Hat, SUSE, Mandrake, and other work-alikes.
Config file::
# Maximum time, in seconds, of hard drive spindown time that you are
# comfortable with. Worst case, it's possible that you could lose this
# amount of work if your battery fails you while in laptop mode.
#MAX_AGE=600
# Automatically disable laptop mode when the number of minutes of battery
# that you have left goes below this threshold.
MINIMUM_BATTERY_MINUTES=10
# Read-ahead, in 512-byte sectors. You can spin down the disk while playing MP3/OGG
# by setting the disk readahead to 8MB (READAHEAD=16384). Effectively, the disk
# will read a complete MP3 at once, and will then spin down while the MP3/OGG is
# playing.
#READAHEAD=4096
# Shall we remount journaled fs. with appropriate commit interval? (1=yes)
#DO_REMOUNTS=1
# And shall we add the "noatime" option to that as well? (1=yes)
#DO_REMOUNT_NOATIME=1
# Dirty synchronous ratio. At this percentage of dirty pages the process
# which
# calls write() does its own writeback
#DIRTY_RATIO=40
#
# Allowed dirty background ratio, in percent. Once DIRTY_RATIO has been
# exceeded, the kernel will wake flusher threads which will then reduce the
# amount of dirty memory to dirty_background_ratio. Set this nice and low,
# so once some writeout has commenced, we do a lot of it.
#
#DIRTY_BACKGROUND_RATIO=5
# kernel default dirty buffer age
#DEF_AGE=30
#DEF_UPDATE=5
#DEF_DIRTY_BACKGROUND_RATIO=10
#DEF_DIRTY_RATIO=40
#DEF_XFS_AGE_BUFFER=15
#DEF_XFS_SYNC_INTERVAL=30
#DEF_XFS_BUFD_INTERVAL=1
# This must be adjusted manually to the value of HZ in the running kernel
# on 2.4, until the XFS people change their 2.4 external interfaces to work in
# centisecs. This can be automated, but it's a work in progress that still
# needs# some fixes. On 2.6 kernels, XFS uses USER_HZ instead of HZ for
# external interfaces, and that is currently always set to 100. So you don't
# need to change this on 2.6.
#XFS_HZ=100
# Should the maximum CPU frequency be adjusted down while on battery?
# Requires CPUFreq to be setup.
# See Documentation/admin-guide/pm/cpufreq.rst for more info
#DO_CPU=0
# When on battery what is the maximum CPU speed that the system should
# use? Legal values are "slowest" for the slowest speed that your
# CPU is able to operate at, or a value listed in:
# /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
# Only applicable if DO_CPU=1.
#CPU_MAXFREQ=slowest
# Idle timeout for your hard drive (man hdparm for valid values, -S option)
# Default is 2 hours on AC (AC_HD=244) and 20 seconds for battery (BATT_HD=4).
#AC_HD=244
#BATT_HD=4
# The drives for which to adjust the idle timeout. Separate them by a space,
# e.g. HD="/dev/hda /dev/hdb".
#HD="/dev/hda"
# Set the spindown timeout on a hard drive?
#DO_HD=1
Control script
--------------
Please note that this control script works for the Linux 2.4 and 2.6 series (thanks
to Kiko Piris).
Control script::
#!/bin/bash
# start or stop laptop_mode, best run by a power management daemon when
# ac gets connected/disconnected from a laptop
#
# install as /sbin/laptop_mode
#
# Contributors to this script: Kiko Piris
# Bart Samwel
# Micha Feigin
# Andrew Morton
# Herve Eychenne
# Dax Kelson
#
# Original Linux 2.4 version by: Jens Axboe
#############################################################################
# Source config
if [ -f /etc/default/laptop-mode ] ; then
# Debian
. /etc/default/laptop-mode
elif [ -f /etc/sysconfig/laptop-mode ] ; then
# Others
. /etc/sysconfig/laptop-mode
fi
# Don't raise an error if the config file is incomplete
# set defaults instead:
# Maximum time, in seconds, of hard drive spindown time that you are
# comfortable with. Worst case, it's possible that you could lose this
# amount of work if your battery fails you while in laptop mode.
MAX_AGE=${MAX_AGE:-'600'}
# Read-ahead, in kilobytes
READAHEAD=${READAHEAD:-'4096'}
# Shall we remount journaled fs. with appropriate commit interval? (1=yes)
DO_REMOUNTS=${DO_REMOUNTS:-'1'}
# And shall we add the "noatime" option to that as well? (1=yes)
DO_REMOUNT_NOATIME=${DO_REMOUNT_NOATIME:-'1'}
# Shall we adjust the idle timeout on a hard drive?
DO_HD=${DO_HD:-'1'}
# Adjust idle timeout on which hard drive?
HD="${HD:-'/dev/hda'}"
# spindown time for HD (hdparm -S values)
AC_HD=${AC_HD:-'244'}
BATT_HD=${BATT_HD:-'4'}
# Dirty synchronous ratio. At this percentage of dirty pages the process which
# calls write() does its own writeback
DIRTY_RATIO=${DIRTY_RATIO:-'40'}
# cpu frequency scaling
# See Documentation/admin-guide/pm/cpufreq.rst for more info
DO_CPU=${CPU_MANAGE:-'0'}
CPU_MAXFREQ=${CPU_MAXFREQ:-'slowest'}
#
# Allowed dirty background ratio, in percent. Once DIRTY_RATIO has been
# exceeded, the kernel will wake flusher threads which will then reduce the
# amount of dirty memory to dirty_background_ratio. Set this nice and low,
# so once some writeout has commenced, we do a lot of it.
#
DIRTY_BACKGROUND_RATIO=${DIRTY_BACKGROUND_RATIO:-'5'}
# kernel default dirty buffer age
DEF_AGE=${DEF_AGE:-'30'}
DEF_UPDATE=${DEF_UPDATE:-'5'}
DEF_DIRTY_BACKGROUND_RATIO=${DEF_DIRTY_BACKGROUND_RATIO:-'10'}
DEF_DIRTY_RATIO=${DEF_DIRTY_RATIO:-'40'}
DEF_XFS_AGE_BUFFER=${DEF_XFS_AGE_BUFFER:-'15'}
DEF_XFS_SYNC_INTERVAL=${DEF_XFS_SYNC_INTERVAL:-'30'}
DEF_XFS_BUFD_INTERVAL=${DEF_XFS_BUFD_INTERVAL:-'1'}
# This must be adjusted manually to the value of HZ in the running kernel
# on 2.4, until the XFS people change their 2.4 external interfaces to work in
# centisecs. This can be automated, but it's a work in progress that still needs
# some fixes. On 2.6 kernels, XFS uses USER_HZ instead of HZ for external
# interfaces, and that is currently always set to 100. So you don't need to
# change this on 2.6.
XFS_HZ=${XFS_HZ:-'100'}
#############################################################################
KLEVEL="$(uname -r |
{
IFS='.' read a b c
echo $a.$b
}
)"
case "$KLEVEL" in
"2.4"|"2.6")
;;
*)
echo "Unhandled kernel version: $KLEVEL ('uname -r' = '$(uname -r)')" >&2
exit 1
;;
esac
if [ ! -e /proc/sys/vm/laptop_mode ] ; then
echo "Kernel is not patched with laptop_mode patch." >&2
exit 1
fi
if [ ! -w /proc/sys/vm/laptop_mode ] ; then
echo "You do not have enough privileges to enable laptop_mode." >&2
exit 1
fi
# Remove an option (the first parameter) of the form option=<number> from
# a mount options string (the rest of the parameters).
parse_mount_opts () {
OPT="$1"
shift
echo ",$*," | sed \
-e 's/,'"$OPT"'=[0-9]*,/,/g' \
-e 's/,,*/,/g' \
-e 's/^,//' \
-e 's/,$//'
}
# Remove an option (the first parameter) without any arguments from
# a mount option string (the rest of the parameters).
parse_nonumber_mount_opts () {
OPT="$1"
shift
echo ",$*," | sed \
-e 's/,'"$OPT"',/,/g' \
-e 's/,,*/,/g' \
-e 's/^,//' \
-e 's/,$//'
}
# Find out the state of a yes/no option (e.g. "atime"/"noatime") in
# fstab for a given filesystem, and use this state to replace the
# value of the option in another mount options string. The device
# is the first argument, the option name the second, and the default
# value the third. The remainder is the mount options string.
#
# Example:
# parse_yesno_opts_wfstab /dev/hda1 atime atime defaults,noatime
#
# If fstab contains, say, "rw" for this filesystem, then the result
# will be "defaults,atime".
parse_yesno_opts_wfstab () {
L_DEV="$1"
OPT="$2"
DEF_OPT="$3"
shift 3
L_OPTS="$*"
PARSEDOPTS1="$(parse_nonumber_mount_opts $OPT $L_OPTS)"
PARSEDOPTS1="$(parse_nonumber_mount_opts no$OPT $PARSEDOPTS1)"
# Watch for a default atime in fstab
FSTAB_OPTS="$(awk '$1 == "'$L_DEV'" { print $4 }' /etc/fstab)"
if echo "$FSTAB_OPTS" | grep "$OPT" > /dev/null ; then
# option specified in fstab: extract the value and use it
if echo "$FSTAB_OPTS" | grep "no$OPT" > /dev/null ; then
echo "$PARSEDOPTS1,no$OPT"
else
# no$OPT not found -- so we must have $OPT.
echo "$PARSEDOPTS1,$OPT"
fi
else
# option not specified in fstab -- choose the default.
echo "$PARSEDOPTS1,$DEF_OPT"
fi
}
# Find out the state of a numbered option (e.g. "commit=NNN") in
# fstab for a given filesystem, and use this state to replace the
# value of the option in another mount options string. The device
# is the first argument, and the option name the second. The
# remainder is the mount options string in which the replacement
# must be done.
#
# Example:
# parse_mount_opts_wfstab /dev/hda1 commit defaults,commit=7
#
# If fstab contains, say, "commit=3,rw" for this filesystem, then the
# result will be "rw,commit=3".
parse_mount_opts_wfstab () {
L_DEV="$1"
OPT="$2"
shift 2
L_OPTS="$*"
PARSEDOPTS1="$(parse_mount_opts $OPT $L_OPTS)"
# Watch for a default commit in fstab
FSTAB_OPTS="$(awk '$1 == "'$L_DEV'" { print $4 }' /etc/fstab)"
if echo "$FSTAB_OPTS" | grep "$OPT=" > /dev/null ; then
# option specified in fstab: extract the value, and use it
echo -n "$PARSEDOPTS1,$OPT="
echo ",$FSTAB_OPTS," | sed \
-e 's/.*,'"$OPT"'=//' \
-e 's/,.*//'
else
# option not specified in fstab: set it to 0
echo "$PARSEDOPTS1,$OPT=0"
fi
}
deduce_fstype () {
MP="$1"
# My root filesystem unfortunately has
# type "unknown" in /etc/mtab. If we encounter
# "unknown", we try to get the type from fstab.
cat /etc/fstab |
grep -v '^#' |
while read FSTAB_DEV FSTAB_MP FSTAB_FST FSTAB_OPTS FSTAB_DUMP FSTAB_DUMP ; do
if [ "$FSTAB_MP" = "$MP" ]; then
echo $FSTAB_FST
exit 0
fi
done
}
if [ $DO_REMOUNT_NOATIME -eq 1 ] ; then
NOATIME_OPT=",noatime"
fi
case "$1" in
start)
AGE=$((100*$MAX_AGE))
XFS_AGE=$(($XFS_HZ*$MAX_AGE))
echo -n "Starting laptop_mode"
if [ -d /proc/sys/vm/pagebuf ] ; then
# (For 2.4 and early 2.6.)
# This only needs to be set, not reset -- it is only used when
# laptop mode is enabled.
echo $XFS_AGE > /proc/sys/vm/pagebuf/lm_flush_age
echo $XFS_AGE > /proc/sys/fs/xfs/lm_sync_interval
elif [ -f /proc/sys/fs/xfs/lm_age_buffer ] ; then
# (A couple of early 2.6 laptop mode patches had these.)
# The same goes for these.
echo $XFS_AGE > /proc/sys/fs/xfs/lm_age_buffer
echo $XFS_AGE > /proc/sys/fs/xfs/lm_sync_interval
elif [ -f /proc/sys/fs/xfs/age_buffer ] ; then
# (2.6.6)
# But not for these -- they are also used in normal
# operation.
echo $XFS_AGE > /proc/sys/fs/xfs/age_buffer
echo $XFS_AGE > /proc/sys/fs/xfs/sync_interval
elif [ -f /proc/sys/fs/xfs/age_buffer_centisecs ] ; then
# (2.6.7 upwards)
# And not for these either. These are in centisecs,
# not USER_HZ, so we have to use $AGE, not $XFS_AGE.
echo $AGE > /proc/sys/fs/xfs/age_buffer_centisecs
echo $AGE > /proc/sys/fs/xfs/xfssyncd_centisecs
echo 3000 > /proc/sys/fs/xfs/xfsbufd_centisecs
fi
case "$KLEVEL" in
"2.4")
echo 1 > /proc/sys/vm/laptop_mode
echo "30 500 0 0 $AGE $AGE 60 20 0" > /proc/sys/vm/bdflush
;;
"2.6")
echo 5 > /proc/sys/vm/laptop_mode
echo "$AGE" > /proc/sys/vm/dirty_writeback_centisecs
echo "$AGE" > /proc/sys/vm/dirty_expire_centisecs
echo "$DIRTY_RATIO" > /proc/sys/vm/dirty_ratio
echo "$DIRTY_BACKGROUND_RATIO" > /proc/sys/vm/dirty_background_ratio
;;
esac
if [ $DO_REMOUNTS -eq 1 ]; then
cat /etc/mtab | while read DEV MP FST OPTS DUMP PASS ; do
PARSEDOPTS="$(parse_mount_opts "$OPTS")"
if [ "$FST" = 'unknown' ]; then
FST=$(deduce_fstype $MP)
fi
case "$FST" in
"ext3"|"reiserfs")
PARSEDOPTS="$(parse_mount_opts commit "$OPTS")"
mount $DEV -t $FST $MP -o remount,$PARSEDOPTS,commit=$MAX_AGE$NOATIME_OPT
;;
"xfs")
mount $DEV -t $FST $MP -o remount,$OPTS$NOATIME_OPT
;;
esac
if [ -b $DEV ] ; then
blockdev --setra $(($READAHEAD * 2)) $DEV
fi
done
fi
if [ $DO_HD -eq 1 ] ; then
for THISHD in $HD ; do
/sbin/hdparm -S $BATT_HD $THISHD > /dev/null 2>&1
/sbin/hdparm -B 1 $THISHD > /dev/null 2>&1
done
fi
if [ $DO_CPU -eq 1 -a -e /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq ]; then
if [ $CPU_MAXFREQ = 'slowest' ]; then
CPU_MAXFREQ=`cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq`
fi
echo $CPU_MAXFREQ > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
fi
echo "."
;;
stop)
U_AGE=$((100*$DEF_UPDATE))
B_AGE=$((100*$DEF_AGE))
echo -n "Stopping laptop_mode"
echo 0 > /proc/sys/vm/laptop_mode
if [ -f /proc/sys/fs/xfs/age_buffer -a ! -f /proc/sys/fs/xfs/lm_age_buffer ] ; then
# These need to be restored, if there are no lm_*.
echo $(($XFS_HZ*$DEF_XFS_AGE_BUFFER)) > /proc/sys/fs/xfs/age_buffer
echo $(($XFS_HZ*$DEF_XFS_SYNC_INTERVAL)) > /proc/sys/fs/xfs/sync_interval
elif [ -f /proc/sys/fs/xfs/age_buffer_centisecs ] ; then
# These need to be restored as well.
echo $((100*$DEF_XFS_AGE_BUFFER)) > /proc/sys/fs/xfs/age_buffer_centisecs
echo $((100*$DEF_XFS_SYNC_INTERVAL)) > /proc/sys/fs/xfs/xfssyncd_centisecs
echo $((100*$DEF_XFS_BUFD_INTERVAL)) > /proc/sys/fs/xfs/xfsbufd_centisecs
fi
case "$KLEVEL" in
"2.4")
echo "30 500 0 0 $U_AGE $B_AGE 60 20 0" > /proc/sys/vm/bdflush
;;
"2.6")
echo "$U_AGE" > /proc/sys/vm/dirty_writeback_centisecs
echo "$B_AGE" > /proc/sys/vm/dirty_expire_centisecs
echo "$DEF_DIRTY_RATIO" > /proc/sys/vm/dirty_ratio
echo "$DEF_DIRTY_BACKGROUND_RATIO" > /proc/sys/vm/dirty_background_ratio
;;
esac
if [ $DO_REMOUNTS -eq 1 ] ; then
cat /etc/mtab | while read DEV MP FST OPTS DUMP PASS ; do
# Reset commit and atime options to defaults.
if [ "$FST" = 'unknown' ]; then
FST=$(deduce_fstype $MP)
fi
case "$FST" in
"ext3"|"reiserfs")
PARSEDOPTS="$(parse_mount_opts_wfstab $DEV commit $OPTS)"
PARSEDOPTS="$(parse_yesno_opts_wfstab $DEV atime atime $PARSEDOPTS)"
mount $DEV -t $FST $MP -o remount,$PARSEDOPTS
;;
"xfs")
PARSEDOPTS="$(parse_yesno_opts_wfstab $DEV atime atime $OPTS)"
mount $DEV -t $FST $MP -o remount,$PARSEDOPTS
;;
esac
if [ -b $DEV ] ; then
blockdev --setra 256 $DEV
fi
done
fi
if [ $DO_HD -eq 1 ] ; then
for THISHD in $HD ; do
/sbin/hdparm -S $AC_HD $THISHD > /dev/null 2>&1
/sbin/hdparm -B 255 $THISHD > /dev/null 2>&1
done
fi
if [ $DO_CPU -eq 1 -a -e /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq ]; then
echo `cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq` > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
fi
echo "."
;;
*)
echo "Usage: $0 {start|stop}" 2>&1
exit 1
;;
esac
exit 0
ACPI integration
----------------
Dax Kelson submitted this so that the ACPI acpid daemon will
kick off the laptop_mode script and run hdparm. The part that
automatically disables laptop mode when the battery is low was
written by Jan Topinski.
/etc/acpi/events/ac_adapter::
event=ac_adapter
action=/etc/acpi/actions/ac.sh %e
/etc/acpi/events/battery::
event=battery.*
action=/etc/acpi/actions/battery.sh %e
/etc/acpi/actions/ac.sh::
#!/bin/bash
# ac on/offline event handler
status=`awk '/^state: / { print $2 }' /proc/acpi/ac_adapter/$2/state`
case $status in
"on-line")
/sbin/laptop_mode stop
exit 0
;;
"off-line")
/sbin/laptop_mode start
exit 0
;;
esac
/etc/acpi/actions/battery.sh::
#! /bin/bash
# Automatically disable laptop mode when the battery almost runs out.
BATT_INFO=/proc/acpi/battery/$2/state
if [[ -f /proc/sys/vm/laptop_mode ]]
then
LM=`cat /proc/sys/vm/laptop_mode`
if [[ $LM -gt 0 ]]
then
if [[ -f $BATT_INFO ]]
then
# Source the config file only now that we know we need
if [ -f /etc/default/laptop-mode ] ; then
# Debian
. /etc/default/laptop-mode
elif [ -f /etc/sysconfig/laptop-mode ] ; then
# Others
. /etc/sysconfig/laptop-mode
fi
MINIMUM_BATTERY_MINUTES=${MINIMUM_BATTERY_MINUTES:-'10'}
ACTION="`cat $BATT_INFO | grep charging | cut -c 26-`"
if [[ ACTION -eq "discharging" ]]
then
PRESENT_RATE=`cat $BATT_INFO | grep "present rate:" | sed "s/.* \([0-9][0-9]* \).*/\1/" `
REMAINING=`cat $BATT_INFO | grep "remaining capacity:" | sed "s/.* \([0-9][0-9]* \).*/\1/" `
fi
if (($REMAINING * 60 / $PRESENT_RATE < $MINIMUM_BATTERY_MINUTES))
then
/sbin/laptop_mode stop
fi
else
logger -p daemon.warning "You are using laptop mode and your battery interface $BATT_INFO is missing. This may lead to loss of data when the battery runs out. Check kernel ACPI support and /proc/acpi/battery folder, and edit /etc/acpi/battery.sh to set BATT_INFO to the correct path."
fi
fi
fi
Monitoring tool
---------------
Bartek Kania submitted this, it can be used to measure how much time your disk
spends spun up/down. See tools/laptop/dslm/dslm.c

View File

@@ -0,0 +1,84 @@
.. SPDX-License-Identifier: GPL-2.0+
LG Gram laptop extra features
=============================
By Matan Ziv-Av <matan@svgalib.org>
Hotkeys
-------
The following FN keys are ignored by the kernel without this driver:
- FN-F1 (LG control panel) - Generates F15
- FN-F5 (Touchpad toggle) - Generates F13
- FN-F6 (Airplane mode) - Generates RFKILL
- FN-F8 (Keyboard backlight) - Generates F16.
This key also changes keyboard backlight mode.
- FN-F9 (Reader mode) - Generates F14
The rest of the FN keys work without a need for a special driver.
Reader mode
-----------
Writing 0/1 to /sys/devices/platform/lg-laptop/reader_mode disables/enables
reader mode. In this mode the screen colors change (blue color reduced),
and the reader mode indicator LED (on F9 key) turns on.
FN Lock
-------
Writing 0/1 to /sys/devices/platform/lg-laptop/fn_lock disables/enables
FN lock.
Battery care limit
------------------
Writing 80/100 to /sys/devices/platform/lg-laptop/battery_care_limit
sets the maximum capacity to charge the battery. Limiting the charge
reduces battery capacity loss over time.
This value is reset to 100 when the kernel boots.
Fan mode
--------
Writing 1/0 to /sys/devices/platform/lg-laptop/fan_mode disables/enables
the fan silent mode.
USB charge
----------
Writing 0/1 to /sys/devices/platform/lg-laptop/usb_charge disables/enables
charging another device from the USB port while the device is turned off.
This value is reset to 0 when the kernel boots.
LEDs
~~~~
The are two LED devices supported by the driver:
Keyboard backlight
------------------
A led device named kbd_led controls the keyboard backlight. There are three
lighting level: off (0), low (127) and high (255).
The keyboard backlight is also controlled by the key combination FN-F8
which cycles through those levels.
Touchpad indicator LED
----------------------
On the F5 key. Controlled by led device names tpad_led.

Some files were not shown because too many files have changed in this diff Show More