Merge branch 'dev-tools' into doc/4.9

Coalesce development-tool documents into a single directory and sphinxify
them.
This commit is contained in:
Jonathan Corbet
2016-08-19 11:38:36 -06:00
15 changed files with 1560 additions and 1500 deletions

View File

@@ -0,0 +1,491 @@
.. Copyright 2010 Nicolas Palix <npalix@diku.dk>
.. Copyright 2010 Julia Lawall <julia@diku.dk>
.. Copyright 2010 Gilles Muller <Gilles.Muller@lip6.fr>
.. highlight:: none
Coccinelle
==========
Coccinelle is a tool for pattern matching and text transformation that has
many uses in kernel development, including the application of complex,
tree-wide patches and detection of problematic programming patterns.
Getting Coccinelle
-------------------
The semantic patches included in the kernel use features and options
which are provided by Coccinelle version 1.0.0-rc11 and above.
Using earlier versions will fail as the option names used by
the Coccinelle files and coccicheck have been updated.
Coccinelle is available through the package manager
of many distributions, e.g. :
- Debian
- Fedora
- Ubuntu
- OpenSUSE
- Arch Linux
- NetBSD
- FreeBSD
You can get the latest version released from the Coccinelle homepage at
http://coccinelle.lip6.fr/
Information and tips about Coccinelle are also provided on the wiki
pages at http://cocci.ekstranet.diku.dk/wiki/doku.php
Once you have it, run the following command::
./configure
make
as a regular user, and install it with::
sudo make install
Supplemental documentation
---------------------------
For supplemental documentation refer to the wiki:
https://bottest.wiki.kernel.org/coccicheck
The wiki documentation always refers to the linux-next version of the script.
Using Coccinelle on the Linux kernel
------------------------------------
A Coccinelle-specific target is defined in the top level
Makefile. This target is named ``coccicheck`` and calls the ``coccicheck``
front-end in the ``scripts`` directory.
Four basic modes are defined: ``patch``, ``report``, ``context``, and
``org``. The mode to use is specified by setting the MODE variable with
``MODE=<mode>``.
- ``patch`` proposes a fix, when possible.
- ``report`` generates a list in the following format:
file:line:column-column: message
- ``context`` highlights lines of interest and their context in a
diff-like style.Lines of interest are indicated with ``-``.
- ``org`` generates a report in the Org mode format of Emacs.
Note that not all semantic patches implement all modes. For easy use
of Coccinelle, the default mode is "report".
Two other modes provide some common combinations of these modes.
- ``chain`` tries the previous modes in the order above until one succeeds.
- ``rep+ctxt`` runs successively the report mode and the context mode.
It should be used with the C option (described later)
which checks the code on a file basis.
Examples
~~~~~~~~
To make a report for every semantic patch, run the following command::
make coccicheck MODE=report
To produce patches, run::
make coccicheck MODE=patch
The coccicheck target applies every semantic patch available in the
sub-directories of ``scripts/coccinelle`` to the entire Linux kernel.
For each semantic patch, a commit message is proposed. It gives a
description of the problem being checked by the semantic patch, and
includes a reference to Coccinelle.
As any static code analyzer, Coccinelle produces false
positives. Thus, reports must be carefully checked, and patches
reviewed.
To enable verbose messages set the V= variable, for example::
make coccicheck MODE=report V=1
Coccinelle parallelization
---------------------------
By default, coccicheck tries to run as parallel as possible. To change
the parallelism, set the J= variable. For example, to run across 4 CPUs::
make coccicheck MODE=report J=4
As of Coccinelle 1.0.2 Coccinelle uses Ocaml parmap for parallelization,
if support for this is detected you will benefit from parmap parallelization.
When parmap is enabled coccicheck will enable dynamic load balancing by using
``--chunksize 1`` argument, this ensures we keep feeding threads with work
one by one, so that we avoid the situation where most work gets done by only
a few threads. With dynamic load balancing, if a thread finishes early we keep
feeding it more work.
When parmap is enabled, if an error occurs in Coccinelle, this error
value is propagated back, the return value of the ``make coccicheck``
captures this return value.
Using Coccinelle with a single semantic patch
---------------------------------------------
The optional make variable COCCI can be used to check a single
semantic patch. In that case, the variable must be initialized with
the name of the semantic patch to apply.
For instance::
make coccicheck COCCI=<my_SP.cocci> MODE=patch
or::
make coccicheck COCCI=<my_SP.cocci> MODE=report
Controlling Which Files are Processed by Coccinelle
---------------------------------------------------
By default the entire kernel source tree is checked.
To apply Coccinelle to a specific directory, ``M=`` can be used.
For example, to check drivers/net/wireless/ one may write::
make coccicheck M=drivers/net/wireless/
To apply Coccinelle on a file basis, instead of a directory basis, the
following command may be used::
make C=1 CHECK="scripts/coccicheck"
To check only newly edited code, use the value 2 for the C flag, i.e.::
make C=2 CHECK="scripts/coccicheck"
In these modes, which works on a file basis, there is no information
about semantic patches displayed, and no commit message proposed.
This runs every semantic patch in scripts/coccinelle by default. The
COCCI variable may additionally be used to only apply a single
semantic patch as shown in the previous section.
The "report" mode is the default. You can select another one with the
MODE variable explained above.
Debugging Coccinelle SmPL patches
---------------------------------
Using coccicheck is best as it provides in the spatch command line
include options matching the options used when we compile the kernel.
You can learn what these options are by using V=1, you could then
manually run Coccinelle with debug options added.
Alternatively you can debug running Coccinelle against SmPL patches
by asking for stderr to be redirected to stderr, by default stderr
is redirected to /dev/null, if you'd like to capture stderr you
can specify the ``DEBUG_FILE="file.txt"`` option to coccicheck. For
instance::
rm -f cocci.err
make coccicheck COCCI=scripts/coccinelle/free/kfree.cocci MODE=report DEBUG_FILE=cocci.err
cat cocci.err
You can use SPFLAGS to add debugging flags, for instance you may want to
add both --profile --show-trying to SPFLAGS when debugging. For instance
you may want to use::
rm -f err.log
export COCCI=scripts/coccinelle/misc/irqf_oneshot.cocci
make coccicheck DEBUG_FILE="err.log" MODE=report SPFLAGS="--profile --show-trying" M=./drivers/mfd/arizona-irq.c
err.log will now have the profiling information, while stdout will
provide some progress information as Coccinelle moves forward with
work.
DEBUG_FILE support is only supported when using coccinelle >= 1.2.
.cocciconfig support
--------------------
Coccinelle supports reading .cocciconfig for default Coccinelle options that
should be used every time spatch is spawned, the order of precedence for
variables for .cocciconfig is as follows:
- Your current user's home directory is processed first
- Your directory from which spatch is called is processed next
- The directory provided with the --dir option is processed last, if used
Since coccicheck runs through make, it naturally runs from the kernel
proper dir, as such the second rule above would be implied for picking up a
.cocciconfig when using ``make coccicheck``.
``make coccicheck`` also supports using M= targets.If you do not supply
any M= target, it is assumed you want to target the entire kernel.
The kernel coccicheck script has::
if [ "$KBUILD_EXTMOD" = "" ] ; then
OPTIONS="--dir $srctree $COCCIINCLUDE"
else
OPTIONS="--dir $KBUILD_EXTMOD $COCCIINCLUDE"
fi
KBUILD_EXTMOD is set when an explicit target with M= is used. For both cases
the spatch --dir argument is used, as such third rule applies when whether M=
is used or not, and when M= is used the target directory can have its own
.cocciconfig file. When M= is not passed as an argument to coccicheck the
target directory is the same as the directory from where spatch was called.
If not using the kernel's coccicheck target, keep the above precedence
order logic of .cocciconfig reading. If using the kernel's coccicheck target,
override any of the kernel's .coccicheck's settings using SPFLAGS.
We help Coccinelle when used against Linux with a set of sensible defaults
options for Linux with our own Linux .cocciconfig. This hints to coccinelle
git can be used for ``git grep`` queries over coccigrep. A timeout of 200
seconds should suffice for now.
The options picked up by coccinelle when reading a .cocciconfig do not appear
as arguments to spatch processes running on your system, to confirm what
options will be used by Coccinelle run::
spatch --print-options-only
You can override with your own preferred index option by using SPFLAGS. Take
note that when there are conflicting options Coccinelle takes precedence for
the last options passed. Using .cocciconfig is possible to use idutils, however
given the order of precedence followed by Coccinelle, since the kernel now
carries its own .cocciconfig, you will need to use SPFLAGS to use idutils if
desired. See below section "Additional flags" for more details on how to use
idutils.
Additional flags
----------------
Additional flags can be passed to spatch through the SPFLAGS
variable. This works as Coccinelle respects the last flags
given to it when options are in conflict. ::
make SPFLAGS=--use-glimpse coccicheck
Coccinelle supports idutils as well but requires coccinelle >= 1.0.6.
When no ID file is specified coccinelle assumes your ID database file
is in the file .id-utils.index on the top level of the kernel, coccinelle
carries a script scripts/idutils_index.sh which creates the database with::
mkid -i C --output .id-utils.index
If you have another database filename you can also just symlink with this
name. ::
make SPFLAGS=--use-idutils coccicheck
Alternatively you can specify the database filename explicitly, for
instance::
make SPFLAGS="--use-idutils /full-path/to/ID" coccicheck
See ``spatch --help`` to learn more about spatch options.
Note that the ``--use-glimpse`` and ``--use-idutils`` options
require external tools for indexing the code. None of them is
thus active by default. However, by indexing the code with
one of these tools, and according to the cocci file used,
spatch could proceed the entire code base more quickly.
SmPL patch specific options
---------------------------
SmPL patches can have their own requirements for options passed
to Coccinelle. SmPL patch specific options can be provided by
providing them at the top of the SmPL patch, for instance::
// Options: --no-includes --include-headers
SmPL patch Coccinelle requirements
----------------------------------
As Coccinelle features get added some more advanced SmPL patches
may require newer versions of Coccinelle. If an SmPL patch requires
at least a version of Coccinelle, this can be specified as follows,
as an example if requiring at least Coccinelle >= 1.0.5::
// Requires: 1.0.5
Proposing new semantic patches
-------------------------------
New semantic patches can be proposed and submitted by kernel
developers. For sake of clarity, they should be organized in the
sub-directories of ``scripts/coccinelle/``.
Detailed description of the ``report`` mode
-------------------------------------------
``report`` generates a list in the following format::
file:line:column-column: message
Example
~~~~~~~
Running::
make coccicheck MODE=report COCCI=scripts/coccinelle/api/err_cast.cocci
will execute the following part of the SmPL script::
<smpl>
@r depends on !context && !patch && (org || report)@
expression x;
position p;
@@
ERR_PTR@p(PTR_ERR(x))
@script:python depends on report@
p << r.p;
x << r.x;
@@
msg="ERR_CAST can be used with %s" % (x)
coccilib.report.print_report(p[0], msg)
</smpl>
This SmPL excerpt generates entries on the standard output, as
illustrated below::
/home/user/linux/crypto/ctr.c:188:9-16: ERR_CAST can be used with alg
/home/user/linux/crypto/authenc.c:619:9-16: ERR_CAST can be used with auth
/home/user/linux/crypto/xts.c:227:9-16: ERR_CAST can be used with alg
Detailed description of the ``patch`` mode
------------------------------------------
When the ``patch`` mode is available, it proposes a fix for each problem
identified.
Example
~~~~~~~
Running::
make coccicheck MODE=patch COCCI=scripts/coccinelle/api/err_cast.cocci
will execute the following part of the SmPL script::
<smpl>
@ depends on !context && patch && !org && !report @
expression x;
@@
- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
</smpl>
This SmPL excerpt generates patch hunks on the standard output, as
illustrated below::
diff -u -p a/crypto/ctr.c b/crypto/ctr.c
--- a/crypto/ctr.c 2010-05-26 10:49:38.000000000 +0200
+++ b/crypto/ctr.c 2010-06-03 23:44:49.000000000 +0200
@@ -185,7 +185,7 @@ static struct crypto_instance *crypto_ct
alg = crypto_attr_alg(tb[1], CRYPTO_ALG_TYPE_CIPHER,
CRYPTO_ALG_TYPE_MASK);
if (IS_ERR(alg))
- return ERR_PTR(PTR_ERR(alg));
+ return ERR_CAST(alg);
/* Block size must be >= 4 bytes. */
err = -EINVAL;
Detailed description of the ``context`` mode
--------------------------------------------
``context`` highlights lines of interest and their context
in a diff-like style.
**NOTE**: The diff-like output generated is NOT an applicable patch. The
intent of the ``context`` mode is to highlight the important lines
(annotated with minus, ``-``) and gives some surrounding context
lines around. This output can be used with the diff mode of
Emacs to review the code.
Example
~~~~~~~
Running::
make coccicheck MODE=context COCCI=scripts/coccinelle/api/err_cast.cocci
will execute the following part of the SmPL script::
<smpl>
@ depends on context && !patch && !org && !report@
expression x;
@@
* ERR_PTR(PTR_ERR(x))
</smpl>
This SmPL excerpt generates diff hunks on the standard output, as
illustrated below::
diff -u -p /home/user/linux/crypto/ctr.c /tmp/nothing
--- /home/user/linux/crypto/ctr.c 2010-05-26 10:49:38.000000000 +0200
+++ /tmp/nothing
@@ -185,7 +185,6 @@ static struct crypto_instance *crypto_ct
alg = crypto_attr_alg(tb[1], CRYPTO_ALG_TYPE_CIPHER,
CRYPTO_ALG_TYPE_MASK);
if (IS_ERR(alg))
- return ERR_PTR(PTR_ERR(alg));
/* Block size must be >= 4 bytes. */
err = -EINVAL;
Detailed description of the ``org`` mode
----------------------------------------
``org`` generates a report in the Org mode format of Emacs.
Example
~~~~~~~
Running::
make coccicheck MODE=org COCCI=scripts/coccinelle/api/err_cast.cocci
will execute the following part of the SmPL script::
<smpl>
@r depends on !context && !patch && (org || report)@
expression x;
position p;
@@
ERR_PTR@p(PTR_ERR(x))
@script:python depends on org@
p << r.p;
x << r.x;
@@
msg="ERR_CAST can be used with %s" % (x)
msg_safe=msg.replace("[","@(").replace("]",")")
coccilib.org.print_todo(p[0], msg_safe)
</smpl>
This SmPL excerpt generates Org entries on the standard output, as
illustrated below::
* TODO [[view:/home/user/linux/crypto/ctr.c::face=ovl-face1::linb=188::colb=9::cole=16][ERR_CAST can be used with alg]]
* TODO [[view:/home/user/linux/crypto/authenc.c::face=ovl-face1::linb=619::colb=9::cole=16][ERR_CAST can be used with auth]]
* TODO [[view:/home/user/linux/crypto/xts.c::face=ovl-face1::linb=227::colb=9::cole=16][ERR_CAST can be used with alg]]

View File

@@ -0,0 +1,256 @@
Using gcov with the Linux kernel
================================
gcov profiling kernel support enables the use of GCC's coverage testing
tool gcov_ with the Linux kernel. Coverage data of a running kernel
is exported in gcov-compatible format via the "gcov" debugfs directory.
To get coverage data for a specific file, change to the kernel build
directory and use gcov with the ``-o`` option as follows (requires root)::
# cd /tmp/linux-out
# gcov -o /sys/kernel/debug/gcov/tmp/linux-out/kernel spinlock.c
This will create source code files annotated with execution counts
in the current directory. In addition, graphical gcov front-ends such
as lcov_ can be used to automate the process of collecting data
for the entire kernel and provide coverage overviews in HTML format.
Possible uses:
* debugging (has this line been reached at all?)
* test improvement (how do I change my test to cover these lines?)
* minimizing kernel configurations (do I need this option if the
associated code is never run?)
.. _gcov: http://gcc.gnu.org/onlinedocs/gcc/Gcov.html
.. _lcov: http://ltp.sourceforge.net/coverage/lcov.php
Preparation
-----------
Configure the kernel with::
CONFIG_DEBUG_FS=y
CONFIG_GCOV_KERNEL=y
select the gcc's gcov format, default is autodetect based on gcc version::
CONFIG_GCOV_FORMAT_AUTODETECT=y
and to get coverage data for the entire kernel::
CONFIG_GCOV_PROFILE_ALL=y
Note that kernels compiled with profiling flags will be significantly
larger and run slower. Also CONFIG_GCOV_PROFILE_ALL may not be supported
on all architectures.
Profiling data will only become accessible once debugfs has been
mounted::
mount -t debugfs none /sys/kernel/debug
Customization
-------------
To enable profiling for specific files or directories, add a line
similar to the following to the respective kernel Makefile:
- For a single file (e.g. main.o)::
GCOV_PROFILE_main.o := y
- For all files in one directory::
GCOV_PROFILE := y
To exclude files from being profiled even when CONFIG_GCOV_PROFILE_ALL
is specified, use::
GCOV_PROFILE_main.o := n
and::
GCOV_PROFILE := n
Only files which are linked to the main kernel image or are compiled as
kernel modules are supported by this mechanism.
Files
-----
The gcov kernel support creates the following files in debugfs:
``/sys/kernel/debug/gcov``
Parent directory for all gcov-related files.
``/sys/kernel/debug/gcov/reset``
Global reset file: resets all coverage data to zero when
written to.
``/sys/kernel/debug/gcov/path/to/compile/dir/file.gcda``
The actual gcov data file as understood by the gcov
tool. Resets file coverage data to zero when written to.
``/sys/kernel/debug/gcov/path/to/compile/dir/file.gcno``
Symbolic link to a static data file required by the gcov
tool. This file is generated by gcc when compiling with
option ``-ftest-coverage``.
Modules
-------
Kernel modules may contain cleanup code which is only run during
module unload time. The gcov mechanism provides a means to collect
coverage data for such code by keeping a copy of the data associated
with the unloaded module. This data remains available through debugfs.
Once the module is loaded again, the associated coverage counters are
initialized with the data from its previous instantiation.
This behavior can be deactivated by specifying the gcov_persist kernel
parameter::
gcov_persist=0
At run-time, a user can also choose to discard data for an unloaded
module by writing to its data file or the global reset file.
Separated build and test machines
---------------------------------
The gcov kernel profiling infrastructure is designed to work out-of-the
box for setups where kernels are built and run on the same machine. In
cases where the kernel runs on a separate machine, special preparations
must be made, depending on where the gcov tool is used:
a) gcov is run on the TEST machine
The gcov tool version on the test machine must be compatible with the
gcc version used for kernel build. Also the following files need to be
copied from build to test machine:
from the source tree:
- all C source files + headers
from the build tree:
- all C source files + headers
- all .gcda and .gcno files
- all links to directories
It is important to note that these files need to be placed into the
exact same file system location on the test machine as on the build
machine. If any of the path components is symbolic link, the actual
directory needs to be used instead (due to make's CURDIR handling).
b) gcov is run on the BUILD machine
The following files need to be copied after each test case from test
to build machine:
from the gcov directory in sysfs:
- all .gcda files
- all links to .gcno files
These files can be copied to any location on the build machine. gcov
must then be called with the -o option pointing to that directory.
Example directory setup on the build machine::
/tmp/linux: kernel source tree
/tmp/out: kernel build directory as specified by make O=
/tmp/coverage: location of the files copied from the test machine
[user@build] cd /tmp/out
[user@build] gcov -o /tmp/coverage/tmp/out/init main.c
Troubleshooting
---------------
Problem
Compilation aborts during linker step.
Cause
Profiling flags are specified for source files which are not
linked to the main kernel or which are linked by a custom
linker procedure.
Solution
Exclude affected source files from profiling by specifying
``GCOV_PROFILE := n`` or ``GCOV_PROFILE_basename.o := n`` in the
corresponding Makefile.
Problem
Files copied from sysfs appear empty or incomplete.
Cause
Due to the way seq_file works, some tools such as cp or tar
may not correctly copy files from sysfs.
Solution
Use ``cat``' to read ``.gcda`` files and ``cp -d`` to copy links.
Alternatively use the mechanism shown in Appendix B.
Appendix A: gather_on_build.sh
------------------------------
Sample script to gather coverage meta files on the build machine
(see 6a)::
#!/bin/bash
KSRC=$1
KOBJ=$2
DEST=$3
if [ -z "$KSRC" ] || [ -z "$KOBJ" ] || [ -z "$DEST" ]; then
echo "Usage: $0 <ksrc directory> <kobj directory> <output.tar.gz>" >&2
exit 1
fi
KSRC=$(cd $KSRC; printf "all:\n\t@echo \${CURDIR}\n" | make -f -)
KOBJ=$(cd $KOBJ; printf "all:\n\t@echo \${CURDIR}\n" | make -f -)
find $KSRC $KOBJ \( -name '*.gcno' -o -name '*.[ch]' -o -type l \) -a \
-perm /u+r,g+r | tar cfz $DEST -P -T -
if [ $? -eq 0 ] ; then
echo "$DEST successfully created, copy to test system and unpack with:"
echo " tar xfz $DEST -P"
else
echo "Could not create file $DEST"
fi
Appendix B: gather_on_test.sh
-----------------------------
Sample script to gather coverage data files on the test machine
(see 6b)::
#!/bin/bash -e
DEST=$1
GCDA=/sys/kernel/debug/gcov
if [ -z "$DEST" ] ; then
echo "Usage: $0 <output.tar.gz>" >&2
exit 1
fi
TEMPDIR=$(mktemp -d)
echo Collecting data..
find $GCDA -type d -exec mkdir -p $TEMPDIR/\{\} \;
find $GCDA -name '*.gcda' -exec sh -c 'cat < $0 > '$TEMPDIR'/$0' {} \;
find $GCDA -name '*.gcno' -exec sh -c 'cp -d $0 '$TEMPDIR'/$0' {} \;
tar czf $DEST -C $TEMPDIR sys
rm -rf $TEMPDIR
echo "$DEST successfully created, copy to build system and unpack with:"
echo " tar xfz $DEST"

View File

@@ -0,0 +1,173 @@
.. highlight:: none
Debugging kernel and modules via gdb
====================================
The kernel debugger kgdb, hypervisors like QEMU or JTAG-based hardware
interfaces allow to debug the Linux kernel and its modules during runtime
using gdb. Gdb comes with a powerful scripting interface for python. The
kernel provides a collection of helper scripts that can simplify typical
kernel debugging steps. This is a short tutorial about how to enable and use
them. It focuses on QEMU/KVM virtual machines as target, but the examples can
be transferred to the other gdb stubs as well.
Requirements
------------
- gdb 7.2+ (recommended: 7.4+) with python support enabled (typically true
for distributions)
Setup
-----
- Create a virtual Linux machine for QEMU/KVM (see www.linux-kvm.org and
www.qemu.org for more details). For cross-development,
http://landley.net/aboriginal/bin keeps a pool of machine images and
toolchains that can be helpful to start from.
- Build the kernel with CONFIG_GDB_SCRIPTS enabled, but leave
CONFIG_DEBUG_INFO_REDUCED off. If your architecture supports
CONFIG_FRAME_POINTER, keep it enabled.
- Install that kernel on the guest.
Alternatively, QEMU allows to boot the kernel directly using -kernel,
-append, -initrd command line switches. This is generally only useful if
you do not depend on modules. See QEMU documentation for more details on
this mode.
- Enable the gdb stub of QEMU/KVM, either
- at VM startup time by appending "-s" to the QEMU command line
or
- during runtime by issuing "gdbserver" from the QEMU monitor
console
- cd /path/to/linux-build
- Start gdb: gdb vmlinux
Note: Some distros may restrict auto-loading of gdb scripts to known safe
directories. In case gdb reports to refuse loading vmlinux-gdb.py, add::
add-auto-load-safe-path /path/to/linux-build
to ~/.gdbinit. See gdb help for more details.
- Attach to the booted guest::
(gdb) target remote :1234
Examples of using the Linux-provided gdb helpers
------------------------------------------------
- Load module (and main kernel) symbols::
(gdb) lx-symbols
loading vmlinux
scanning for modules in /home/user/linux/build
loading @0xffffffffa0020000: /home/user/linux/build/net/netfilter/xt_tcpudp.ko
loading @0xffffffffa0016000: /home/user/linux/build/net/netfilter/xt_pkttype.ko
loading @0xffffffffa0002000: /home/user/linux/build/net/netfilter/xt_limit.ko
loading @0xffffffffa00ca000: /home/user/linux/build/net/packet/af_packet.ko
loading @0xffffffffa003c000: /home/user/linux/build/fs/fuse/fuse.ko
...
loading @0xffffffffa0000000: /home/user/linux/build/drivers/ata/ata_generic.ko
- Set a breakpoint on some not yet loaded module function, e.g.::
(gdb) b btrfs_init_sysfs
Function "btrfs_init_sysfs" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (btrfs_init_sysfs) pending.
- Continue the target::
(gdb) c
- Load the module on the target and watch the symbols being loaded as well as
the breakpoint hit::
loading @0xffffffffa0034000: /home/user/linux/build/lib/libcrc32c.ko
loading @0xffffffffa0050000: /home/user/linux/build/lib/lzo/lzo_compress.ko
loading @0xffffffffa006e000: /home/user/linux/build/lib/zlib_deflate/zlib_deflate.ko
loading @0xffffffffa01b1000: /home/user/linux/build/fs/btrfs/btrfs.ko
Breakpoint 1, btrfs_init_sysfs () at /home/user/linux/fs/btrfs/sysfs.c:36
36 btrfs_kset = kset_create_and_add("btrfs", NULL, fs_kobj);
- Dump the log buffer of the target kernel::
(gdb) lx-dmesg
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Linux version 3.8.0-rc4-dbg+ (...
[ 0.000000] Command line: root=/dev/sda2 resume=/dev/sda1 vga=0x314
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
....
- Examine fields of the current task struct::
(gdb) p $lx_current().pid
$1 = 4998
(gdb) p $lx_current().comm
$2 = "modprobe\000\000\000\000\000\000\000"
- Make use of the per-cpu function for the current or a specified CPU::
(gdb) p $lx_per_cpu("runqueues").nr_running
$3 = 1
(gdb) p $lx_per_cpu("runqueues", 2).nr_running
$4 = 0
- Dig into hrtimers using the container_of helper::
(gdb) set $next = $lx_per_cpu("hrtimer_bases").clock_base[0].active.next
(gdb) p *$container_of($next, "struct hrtimer", "node")
$5 = {
node = {
node = {
__rb_parent_color = 18446612133355256072,
rb_right = 0x0 <irq_stack_union>,
rb_left = 0x0 <irq_stack_union>
},
expires = {
tv64 = 1835268000000
}
},
_softexpires = {
tv64 = 1835268000000
},
function = 0xffffffff81078232 <tick_sched_timer>,
base = 0xffff88003fd0d6f0,
state = 1,
start_pid = 0,
start_site = 0xffffffff81055c1f <hrtimer_start_range_ns+20>,
start_comm = "swapper/2\000\000\000\000\000\000"
}
List of commands and functions
------------------------------
The number of commands and convenience functions may evolve over the time,
this is just a snapshot of the initial version::
(gdb) apropos lx
function lx_current -- Return current task
function lx_module -- Find module by name and return the module variable
function lx_per_cpu -- Return per-cpu variable
function lx_task_by_pid -- Find Linux task by PID and return the task_struct variable
function lx_thread_info -- Calculate Linux thread_info from task variable
lx-dmesg -- Print Linux kernel log buffer
lx-lsmod -- List currently loaded modules
lx-symbols -- (Re-)load symbols of Linux kernel and currently loaded modules
Detailed help can be obtained via "help <command-name>" for commands and "help
function <function-name>" for convenience functions.

View File

@@ -0,0 +1,173 @@
The Kernel Address Sanitizer (KASAN)
====================================
Overview
--------
KernelAddressSANitizer (KASAN) is a dynamic memory error detector. It provides
a fast and comprehensive solution for finding use-after-free and out-of-bounds
bugs.
KASAN uses compile-time instrumentation for checking every memory access,
therefore you will need a GCC version 4.9.2 or later. GCC 5.0 or later is
required for detection of out-of-bounds accesses to stack or global variables.
Currently KASAN is supported only for the x86_64 and arm64 architectures.
Usage
-----
To enable KASAN configure kernel with::
CONFIG_KASAN = y
and choose between CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE. Outline and
inline are compiler instrumentation types. The former produces smaller binary
the latter is 1.1 - 2 times faster. Inline instrumentation requires a GCC
version 5.0 or later.
KASAN works with both SLUB and SLAB memory allocators.
For better bug detection and nicer reporting, enable CONFIG_STACKTRACE.
To disable instrumentation for specific files or directories, add a line
similar to the following to the respective kernel Makefile:
- For a single file (e.g. main.o)::
KASAN_SANITIZE_main.o := n
- For all files in one directory::
KASAN_SANITIZE := n
Error reports
~~~~~~~~~~~~~
A typical out of bounds access report looks like this::
==================================================================
BUG: AddressSanitizer: out of bounds access in kmalloc_oob_right+0x65/0x75 [test_kasan] at addr ffff8800693bc5d3
Write of size 1 by task modprobe/1689
=============================================================================
BUG kmalloc-128 (Not tainted): kasan error
-----------------------------------------------------------------------------
Disabling lock debugging due to kernel taint
INFO: Allocated in kmalloc_oob_right+0x3d/0x75 [test_kasan] age=0 cpu=0 pid=1689
__slab_alloc+0x4b4/0x4f0
kmem_cache_alloc_trace+0x10b/0x190
kmalloc_oob_right+0x3d/0x75 [test_kasan]
init_module+0x9/0x47 [test_kasan]
do_one_initcall+0x99/0x200
load_module+0x2cb3/0x3b20
SyS_finit_module+0x76/0x80
system_call_fastpath+0x12/0x17
INFO: Slab 0xffffea0001a4ef00 objects=17 used=7 fp=0xffff8800693bd728 flags=0x100000000004080
INFO: Object 0xffff8800693bc558 @offset=1368 fp=0xffff8800693bc720
Bytes b4 ffff8800693bc548: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
Object ffff8800693bc558: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff8800693bc568: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff8800693bc578: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff8800693bc588: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff8800693bc598: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff8800693bc5a8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff8800693bc5b8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object ffff8800693bc5c8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk.
Redzone ffff8800693bc5d8: cc cc cc cc cc cc cc cc ........
Padding ffff8800693bc718: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
CPU: 0 PID: 1689 Comm: modprobe Tainted: G B 3.18.0-rc1-mm1+ #98
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
ffff8800693bc000 0000000000000000 ffff8800693bc558 ffff88006923bb78
ffffffff81cc68ae 00000000000000f3 ffff88006d407600 ffff88006923bba8
ffffffff811fd848 ffff88006d407600 ffffea0001a4ef00 ffff8800693bc558
Call Trace:
[<ffffffff81cc68ae>] dump_stack+0x46/0x58
[<ffffffff811fd848>] print_trailer+0xf8/0x160
[<ffffffffa00026a7>] ? kmem_cache_oob+0xc3/0xc3 [test_kasan]
[<ffffffff811ff0f5>] object_err+0x35/0x40
[<ffffffffa0002065>] ? kmalloc_oob_right+0x65/0x75 [test_kasan]
[<ffffffff8120b9fa>] kasan_report_error+0x38a/0x3f0
[<ffffffff8120a79f>] ? kasan_poison_shadow+0x2f/0x40
[<ffffffff8120b344>] ? kasan_unpoison_shadow+0x14/0x40
[<ffffffff8120a79f>] ? kasan_poison_shadow+0x2f/0x40
[<ffffffffa00026a7>] ? kmem_cache_oob+0xc3/0xc3 [test_kasan]
[<ffffffff8120a995>] __asan_store1+0x75/0xb0
[<ffffffffa0002601>] ? kmem_cache_oob+0x1d/0xc3 [test_kasan]
[<ffffffffa0002065>] ? kmalloc_oob_right+0x65/0x75 [test_kasan]
[<ffffffffa0002065>] kmalloc_oob_right+0x65/0x75 [test_kasan]
[<ffffffffa00026b0>] init_module+0x9/0x47 [test_kasan]
[<ffffffff810002d9>] do_one_initcall+0x99/0x200
[<ffffffff811e4e5c>] ? __vunmap+0xec/0x160
[<ffffffff81114f63>] load_module+0x2cb3/0x3b20
[<ffffffff8110fd70>] ? m_show+0x240/0x240
[<ffffffff81115f06>] SyS_finit_module+0x76/0x80
[<ffffffff81cd3129>] system_call_fastpath+0x12/0x17
Memory state around the buggy address:
ffff8800693bc300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8800693bc380: fc fc 00 00 00 00 00 00 00 00 00 00 00 00 00 fc
ffff8800693bc400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8800693bc480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8800693bc500: fc fc fc fc fc fc fc fc fc fc fc 00 00 00 00 00
>ffff8800693bc580: 00 00 00 00 00 00 00 00 00 00 03 fc fc fc fc fc
^
ffff8800693bc600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8800693bc680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8800693bc700: fc fc fc fc fb fb fb fb fb fb fb fb fb fb fb fb
ffff8800693bc780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8800693bc800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
The header of the report discribe what kind of bug happened and what kind of
access caused it. It's followed by the description of the accessed slub object
(see 'SLUB Debug output' section in Documentation/vm/slub.txt for details) and
the description of the accessed memory page.
In the last section the report shows memory state around the accessed address.
Reading this part requires some understanding of how KASAN works.
The state of each 8 aligned bytes of memory is encoded in one shadow byte.
Those 8 bytes can be accessible, partially accessible, freed or be a redzone.
We use the following encoding for each shadow byte: 0 means that all 8 bytes
of the corresponding memory region are accessible; number N (1 <= N <= 7) means
that the first N bytes are accessible, and other (8 - N) bytes are not;
any negative value indicates that the entire 8-byte word is inaccessible.
We use different negative values to distinguish between different kinds of
inaccessible memory like redzones or freed memory (see mm/kasan/kasan.h).
In the report above the arrows point to the shadow byte 03, which means that
the accessed address is partially accessible.
Implementation details
----------------------
From a high level, our approach to memory error detection is similar to that
of kmemcheck: use shadow memory to record whether each byte of memory is safe
to access, and use compile-time instrumentation to check shadow memory on each
memory access.
AddressSanitizer dedicates 1/8 of kernel memory to its shadow memory
(e.g. 16TB to cover 128TB on x86_64) and uses direct mapping with a scale and
offset to translate a memory address to its corresponding shadow address.
Here is the function which translates an address to its corresponding shadow
address::
static inline void *kasan_mem_to_shadow(const void *addr)
{
return ((unsigned long)addr >> KASAN_SHADOW_SCALE_SHIFT)
+ KASAN_SHADOW_OFFSET;
}
where ``KASAN_SHADOW_SCALE_SHIFT = 3``.
Compile-time instrumentation used for checking memory accesses. Compiler inserts
function calls (__asan_load*(addr), __asan_store*(addr)) before each memory
access of size 1, 2, 4, 8 or 16. These functions check whether memory access is
valid or not by checking corresponding shadow memory.
GCC 5.0 has possibility to perform inline instrumentation. Instead of making
function calls GCC directly inserts the code to check the shadow memory.
This option significantly enlarges kernel but it gives x1.1-x2 performance
boost over outline instrumented kernel.

View File

@@ -0,0 +1,111 @@
kcov: code coverage for fuzzing
===============================
kcov exposes kernel code coverage information in a form suitable for coverage-
guided fuzzing (randomized testing). Coverage data of a running kernel is
exported via the "kcov" debugfs file. Coverage collection is enabled on a task
basis, and thus it can capture precise coverage of a single system call.
Note that kcov does not aim to collect as much coverage as possible. It aims
to collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard interrupts
and instrumentation of some inherently non-deterministic parts of kernel is
disbled (e.g. scheduler, locking).
Usage
-----
Configure the kernel with::
CONFIG_KCOV=y
CONFIG_KCOV requires gcc built on revision 231296 or later.
Profiling data will only become accessible once debugfs has been mounted::
mount -t debugfs none /sys/kernel/debug
The following program demonstrates kcov usage from within a test program::
#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>
#define KCOV_INIT_TRACE _IOR('c', 1, unsigned long)
#define KCOV_ENABLE _IO('c', 100)
#define KCOV_DISABLE _IO('c', 101)
#define COVER_SIZE (64<<10)
int main(int argc, char **argv)
{
int fd;
unsigned long *cover, n, i;
/* A single fd descriptor allows coverage collection on a single
* thread.
*/
fd = open("/sys/kernel/debug/kcov", O_RDWR);
if (fd == -1)
perror("open"), exit(1);
/* Setup trace mode and trace size. */
if (ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE))
perror("ioctl"), exit(1);
/* Mmap buffer shared between kernel- and user-space. */
cover = (unsigned long*)mmap(NULL, COVER_SIZE * sizeof(unsigned long),
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if ((void*)cover == MAP_FAILED)
perror("mmap"), exit(1);
/* Enable coverage collection on the current thread. */
if (ioctl(fd, KCOV_ENABLE, 0))
perror("ioctl"), exit(1);
/* Reset coverage from the tail of the ioctl() call. */
__atomic_store_n(&cover[0], 0, __ATOMIC_RELAXED);
/* That's the target syscal call. */
read(-1, NULL, 0);
/* Read number of PCs collected. */
n = __atomic_load_n(&cover[0], __ATOMIC_RELAXED);
for (i = 0; i < n; i++)
printf("0x%lx\n", cover[i + 1]);
/* Disable coverage collection for the current thread. After this call
* coverage can be enabled for a different thread.
*/
if (ioctl(fd, KCOV_DISABLE, 0))
perror("ioctl"), exit(1);
/* Free resources. */
if (munmap(cover, COVER_SIZE * sizeof(unsigned long)))
perror("munmap"), exit(1);
if (close(fd))
perror("close"), exit(1);
return 0;
}
After piping through addr2line output of the program looks as follows::
SyS_read
fs/read_write.c:562
__fdget_pos
fs/file.c:774
__fget_light
fs/file.c:746
__fget_light
fs/file.c:750
__fget_light
fs/file.c:760
__fdget_pos
fs/file.c:784
SyS_read
fs/read_write.c:562
If a program needs to collect coverage from several threads (independently),
it needs to open /sys/kernel/debug/kcov in each thread separately.
The interface is fine-grained to allow efficient forking of test processes.
That is, a parent process opens /sys/kernel/debug/kcov, enables trace mode,
mmaps coverage buffer and then forks child processes in a loop. Child processes
only need to enable coverage (disable happens automatically on thread end).

View File

@@ -0,0 +1,733 @@
Getting started with kmemcheck
==============================
Vegard Nossum <vegardno@ifi.uio.no>
Introduction
------------
kmemcheck is a debugging feature for the Linux Kernel. More specifically, it
is a dynamic checker that detects and warns about some uses of uninitialized
memory.
Userspace programmers might be familiar with Valgrind's memcheck. The main
difference between memcheck and kmemcheck is that memcheck works for userspace
programs only, and kmemcheck works for the kernel only. The implementations
are of course vastly different. Because of this, kmemcheck is not as accurate
as memcheck, but it turns out to be good enough in practice to discover real
programmer errors that the compiler is not able to find through static
analysis.
Enabling kmemcheck on a kernel will probably slow it down to the extent that
the machine will not be usable for normal workloads such as e.g. an
interactive desktop. kmemcheck will also cause the kernel to use about twice
as much memory as normal. For this reason, kmemcheck is strictly a debugging
feature.
Downloading
-----------
As of version 2.6.31-rc1, kmemcheck is included in the mainline kernel.
Configuring and compiling
-------------------------
kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of
configuration variables must have specific settings in order for the kmemcheck
menu to even appear in "menuconfig". These are:
- ``CONFIG_CC_OPTIMIZE_FOR_SIZE=n``
This option is located under "General setup" / "Optimize for size".
Without this, gcc will use certain optimizations that usually lead to
false positive warnings from kmemcheck. An example of this is a 16-bit
field in a struct, where gcc may load 32 bits, then discard the upper
16 bits. kmemcheck sees only the 32-bit load, and may trigger a
warning for the upper 16 bits (if they're uninitialized).
- ``CONFIG_SLAB=y`` or ``CONFIG_SLUB=y``
This option is located under "General setup" / "Choose SLAB
allocator".
- ``CONFIG_FUNCTION_TRACER=n``
This option is located under "Kernel hacking" / "Tracers" / "Kernel
Function Tracer"
When function tracing is compiled in, gcc emits a call to another
function at the beginning of every function. This means that when the
page fault handler is called, the ftrace framework will be called
before kmemcheck has had a chance to handle the fault. If ftrace then
modifies memory that was tracked by kmemcheck, the result is an
endless recursive page fault.
- ``CONFIG_DEBUG_PAGEALLOC=n``
This option is located under "Kernel hacking" / "Memory Debugging"
/ "Debug page memory allocations".
In addition, I highly recommend turning on ``CONFIG_DEBUG_INFO=y``. This is also
located under "Kernel hacking". With this, you will be able to get line number
information from the kmemcheck warnings, which is extremely valuable in
debugging a problem. This option is not mandatory, however, because it slows
down the compilation process and produces a much bigger kernel image.
Now the kmemcheck menu should be visible (under "Kernel hacking" / "Memory
Debugging" / "kmemcheck: trap use of uninitialized memory"). Here follows
a description of the kmemcheck configuration variables:
- ``CONFIG_KMEMCHECK``
This must be enabled in order to use kmemcheck at all...
- ``CONFIG_KMEMCHECK_``[``DISABLED`` | ``ENABLED`` | ``ONESHOT``]``_BY_DEFAULT``
This option controls the status of kmemcheck at boot-time. "Enabled"
will enable kmemcheck right from the start, "disabled" will boot the
kernel as normal (but with the kmemcheck code compiled in, so it can
be enabled at run-time after the kernel has booted), and "one-shot" is
a special mode which will turn kmemcheck off automatically after
detecting the first use of uninitialized memory.
If you are using kmemcheck to actively debug a problem, then you
probably want to choose "enabled" here.
The one-shot mode is mostly useful in automated test setups because it
can prevent floods of warnings and increase the chances of the machine
surviving in case something is really wrong. In other cases, the one-
shot mode could actually be counter-productive because it would turn
itself off at the very first error -- in the case of a false positive
too -- and this would come in the way of debugging the specific
problem you were interested in.
If you would like to use your kernel as normal, but with a chance to
enable kmemcheck in case of some problem, it might be a good idea to
choose "disabled" here. When kmemcheck is disabled, most of the run-
time overhead is not incurred, and the kernel will be almost as fast
as normal.
- ``CONFIG_KMEMCHECK_QUEUE_SIZE``
Select the maximum number of error reports to store in an internal
(fixed-size) buffer. Since errors can occur virtually anywhere and in
any context, we need a temporary storage area which is guaranteed not
to generate any other page faults when accessed. The queue will be
emptied as soon as a tasklet may be scheduled. If the queue is full,
new error reports will be lost.
The default value of 64 is probably fine. If some code produces more
than 64 errors within an irqs-off section, then the code is likely to
produce many, many more, too, and these additional reports seldom give
any more information (the first report is usually the most valuable
anyway).
This number might have to be adjusted if you are not using serial
console or similar to capture the kernel log. If you are using the
"dmesg" command to save the log, then getting a lot of kmemcheck
warnings might overflow the kernel log itself, and the earlier reports
will get lost in that way instead. Try setting this to 10 or so on
such a setup.
- ``CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT``
Select the number of shadow bytes to save along with each entry of the
error-report queue. These bytes indicate what parts of an allocation
are initialized, uninitialized, etc. and will be displayed when an
error is detected to help the debugging of a particular problem.
The number entered here is actually the logarithm of the number of
bytes that will be saved. So if you pick for example 5 here, kmemcheck
will save 2^5 = 32 bytes.
The default value should be fine for debugging most problems. It also
fits nicely within 80 columns.
- ``CONFIG_KMEMCHECK_PARTIAL_OK``
This option (when enabled) works around certain GCC optimizations that
produce 32-bit reads from 16-bit variables where the upper 16 bits are
thrown away afterwards.
The default value (enabled) is recommended. This may of course hide
some real errors, but disabling it would probably produce a lot of
false positives.
- ``CONFIG_KMEMCHECK_BITOPS_OK``
This option silences warnings that would be generated for bit-field
accesses where not all the bits are initialized at the same time. This
may also hide some real bugs.
This option is probably obsolete, or it should be replaced with
the kmemcheck-/bitfield-annotations for the code in question. The
default value is therefore fine.
Now compile the kernel as usual.
How to use
----------
Booting
~~~~~~~
First some information about the command-line options. There is only one
option specific to kmemcheck, and this is called "kmemcheck". It can be used
to override the default mode as chosen by the ``CONFIG_KMEMCHECK_*_BY_DEFAULT``
option. Its possible settings are:
- ``kmemcheck=0`` (disabled)
- ``kmemcheck=1`` (enabled)
- ``kmemcheck=2`` (one-shot mode)
If SLUB debugging has been enabled in the kernel, it may take precedence over
kmemcheck in such a way that the slab caches which are under SLUB debugging
will not be tracked by kmemcheck. In order to ensure that this doesn't happen
(even though it shouldn't by default), use SLUB's boot option ``slub_debug``,
like this: ``slub_debug=-``
In fact, this option may also be used for fine-grained control over SLUB vs.
kmemcheck. For example, if the command line includes
``kmemcheck=1 slub_debug=,dentry``, then SLUB debugging will be used only
for the "dentry" slab cache, and with kmemcheck tracking all the other
caches. This is advanced usage, however, and is not generally recommended.
Run-time enable/disable
~~~~~~~~~~~~~~~~~~~~~~~
When the kernel has booted, it is possible to enable or disable kmemcheck at
run-time. WARNING: This feature is still experimental and may cause false
positive warnings to appear. Therefore, try not to use this. If you find that
it doesn't work properly (e.g. you see an unreasonable amount of warnings), I
will be happy to take bug reports.
Use the file ``/proc/sys/kernel/kmemcheck`` for this purpose, e.g.::
$ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck
The numbers are the same as for the ``kmemcheck=`` command-line option.
Debugging
~~~~~~~~~
A typical report will look something like this::
WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024)
80000000000000000000000000000000000000000088ffff0000000000000000
i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u
^
Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A
RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190
RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002
RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009
RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84
RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000
R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e
R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8
FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
[<ffffffff8104f04e>] dequeue_signal+0x8e/0x170
[<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390
[<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0
[<ffffffff8100c7b5>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff
The single most valuable information in this report is the RIP (or EIP on 32-
bit) value. This will help us pinpoint exactly which instruction that caused
the warning.
If your kernel was compiled with ``CONFIG_DEBUG_INFO=y``, then all we have to do
is give this address to the addr2line program, like this::
$ addr2line -e vmlinux -i ffffffff8104ede8
arch/x86/include/asm/string_64.h:12
include/asm-generic/siginfo.h:287
kernel/signal.c:380
kernel/signal.c:410
The "``-e vmlinux``" tells addr2line which file to look in. **IMPORTANT:**
This must be the vmlinux of the kernel that produced the warning in the
first place! If not, the line number information will almost certainly be
wrong.
The "``-i``" tells addr2line to also print the line numbers of inlined
functions. In this case, the flag was very important, because otherwise,
it would only have printed the first line, which is just a call to
``memcpy()``, which could be called from a thousand places in the kernel, and
is therefore not very useful. These inlined functions would not show up in
the stack trace above, simply because the kernel doesn't load the extra
debugging information. This technique can of course be used with ordinary
kernel oopses as well.
In this case, it's the caller of ``memcpy()`` that is interesting, and it can be
found in ``include/asm-generic/siginfo.h``, line 287::
281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from)
282 {
283 if (from->si_code < 0)
284 memcpy(to, from, sizeof(*to));
285 else
286 /* _sigchld is currently the largest know union member */
287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld));
288 }
Since this was a read (kmemcheck usually warns about reads only, though it can
warn about writes to unallocated or freed memory as well), it was probably the
"from" argument which contained some uninitialized bytes. Following the chain
of calls, we move upwards to see where "from" was allocated or initialized,
``kernel/signal.c``, line 380::
359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info)
360 {
...
367 list_for_each_entry(q, &list->list, list) {
368 if (q->info.si_signo == sig) {
369 if (first)
370 goto still_pending;
371 first = q;
...
377 if (first) {
378 still_pending:
379 list_del_init(&first->list);
380 copy_siginfo(info, &first->info);
381 __sigqueue_free(first);
...
392 }
393 }
Here, it is ``&first->info`` that is being passed on to ``copy_siginfo()``. The
variable ``first`` was found on a list -- passed in as the second argument to
``collect_signal()``. We continue our journey through the stack, to figure out
where the item on "list" was allocated or initialized. We move to line 410::
395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask,
396 siginfo_t *info)
397 {
...
410 collect_signal(sig, pending, info);
...
414 }
Now we need to follow the ``pending`` pointer, since that is being passed on to
``collect_signal()`` as ``list``. At this point, we've run out of lines from the
"addr2line" output. Not to worry, we just paste the next addresses from the
kmemcheck stack dump, i.e.::
[<ffffffff8104f04e>] dequeue_signal+0x8e/0x170
[<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390
[<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0
[<ffffffff8100c7b5>] int_signal+0x12/0x17
$ addr2line -e vmlinux -i ffffffff8104f04e ffffffff81050bd8 \
ffffffff8100b87d ffffffff8100c7b5
kernel/signal.c:446
kernel/signal.c:1806
arch/x86/kernel/signal.c:805
arch/x86/kernel/signal.c:871
arch/x86/kernel/entry_64.S:694
Remember that since these addresses were found on the stack and not as the
RIP value, they actually point to the _next_ instruction (they are return
addresses). This becomes obvious when we look at the code for line 446::
422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
423 {
...
431 signr = __dequeue_signal(&tsk->signal->shared_pending,
432 mask, info);
433 /*
434 * itimer signal ?
435 *
436 * itimers are process shared and we restart periodic
437 * itimers in the signal delivery path to prevent DoS
438 * attacks in the high resolution timer case. This is
439 * compliant with the old way of self restarting
440 * itimers, as the SIGALRM is a legacy signal and only
441 * queued once. Changing the restart behaviour to
442 * restart the timer in the signal dequeue path is
443 * reducing the timer noise on heavy loaded !highres
444 * systems too.
445 */
446 if (unlikely(signr == SIGALRM)) {
...
489 }
So instead of looking at 446, we should be looking at 431, which is the line
that executes just before 446. Here we see that what we are looking for is
``&tsk->signal->shared_pending``.
Our next task is now to figure out which function that puts items on this
``shared_pending`` list. A crude, but efficient tool, is ``git grep``::
$ git grep -n 'shared_pending' kernel/
...
kernel/signal.c:828: pending = group ? &t->signal->shared_pending : &t->pending;
kernel/signal.c:1339: pending = group ? &t->signal->shared_pending : &t->pending;
...
There were more results, but none of them were related to list operations,
and these were the only assignments. We inspect the line numbers more closely
and find that this is indeed where items are being added to the list::
816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
817 int group)
818 {
...
828 pending = group ? &t->signal->shared_pending : &t->pending;
...
851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN &&
852 (is_si_special(info) ||
853 info->si_code >= 0)));
854 if (q) {
855 list_add_tail(&q->list, &pending->list);
...
890 }
and::
1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
1310 {
....
1339 pending = group ? &t->signal->shared_pending : &t->pending;
1340 list_add_tail(&q->list, &pending->list);
....
1347 }
In the first case, the list element we are looking for, ``q``, is being
returned from the function ``__sigqueue_alloc()``, which looks like an
allocation function. Let's take a look at it::
187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags,
188 int override_rlimit)
189 {
190 struct sigqueue *q = NULL;
191 struct user_struct *user;
192
193 /*
194 * We won't get problems with the target's UID changing under us
195 * because changing it requires RCU be used, and if t != current, the
196 * caller must be holding the RCU readlock (by way of a spinlock) and
197 * we use RCU protection here
198 */
199 user = get_uid(__task_cred(t)->user);
200 atomic_inc(&user->sigpending);
201 if (override_rlimit ||
202 atomic_read(&user->sigpending) <=
203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur)
204 q = kmem_cache_alloc(sigqueue_cachep, flags);
205 if (unlikely(q == NULL)) {
206 atomic_dec(&user->sigpending);
207 free_uid(user);
208 } else {
209 INIT_LIST_HEAD(&q->list);
210 q->flags = 0;
211 q->user = user;
212 }
213
214 return q;
215 }
We see that this function initializes ``q->list``, ``q->flags``, and
``q->user``. It seems that now is the time to look at the definition of
``struct sigqueue``, e.g.::
14 struct sigqueue {
15 struct list_head list;
16 int flags;
17 siginfo_t info;
18 struct user_struct *user;
19 };
And, you might remember, it was a ``memcpy()`` on ``&first->info`` that
caused the warning, so this makes perfect sense. It also seems reasonable
to assume that it is the caller of ``__sigqueue_alloc()`` that has the
responsibility of filling out (initializing) this member.
But just which fields of the struct were uninitialized? Let's look at
kmemcheck's report again::
WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024)
80000000000000000000000000000000000000000088ffff0000000000000000
i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u
^
These first two lines are the memory dump of the memory object itself, and
the shadow bytemap, respectively. The memory object itself is in this case
``&first->info``. Just beware that the start of this dump is NOT the start
of the object itself! The position of the caret (^) corresponds with the
address of the read (ffff88003e4a2024).
The shadow bytemap dump legend is as follows:
- i: initialized
- u: uninitialized
- a: unallocated (memory has been allocated by the slab layer, but has not
yet been handed off to anybody)
- f: freed (memory has been allocated by the slab layer, but has been freed
by the previous owner)
In order to figure out where (relative to the start of the object) the
uninitialized memory was located, we have to look at the disassembly. For
that, we'll need the RIP address again::
RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190
$ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8:
ffffffff8104edc8: mov %r8,0x8(%r8)
ffffffff8104edcc: test %r10d,%r10d
ffffffff8104edcf: js ffffffff8104ee88 <__dequeue_signal+0x168>
ffffffff8104edd5: mov %rax,%rdx
ffffffff8104edd8: mov $0xc,%ecx
ffffffff8104eddd: mov %r13,%rdi
ffffffff8104ede0: mov $0x30,%eax
ffffffff8104ede5: mov %rdx,%rsi
ffffffff8104ede8: rep movsl %ds:(%rsi),%es:(%rdi)
ffffffff8104edea: test $0x2,%al
ffffffff8104edec: je ffffffff8104edf0 <__dequeue_signal+0xd0>
ffffffff8104edee: movsw %ds:(%rsi),%es:(%rdi)
ffffffff8104edf0: test $0x1,%al
ffffffff8104edf2: je ffffffff8104edf5 <__dequeue_signal+0xd5>
ffffffff8104edf4: movsb %ds:(%rsi),%es:(%rdi)
ffffffff8104edf5: mov %r8,%rdi
ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free>
As expected, it's the "``rep movsl``" instruction from the ``memcpy()``
that causes the warning. We know about ``REP MOVSL`` that it uses the register
``RCX`` to count the number of remaining iterations. By taking a look at the
register dump again (from the kmemcheck report), we can figure out how many
bytes were left to copy::
RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009
By looking at the disassembly, we also see that ``%ecx`` is being loaded
with the value ``$0xc`` just before (ffffffff8104edd8), so we are very
lucky. Keep in mind that this is the number of iterations, not bytes. And
since this is a "long" operation, we need to multiply by 4 to get the
number of bytes. So this means that the uninitialized value was encountered
at 4 * (0xc - 0x9) = 12 bytes from the start of the object.
We can now try to figure out which field of the "``struct siginfo``" that
was not initialized. This is the beginning of the struct::
40 typedef struct siginfo {
41 int si_signo;
42 int si_errno;
43 int si_code;
44
45 union {
..
92 } _sifields;
93 } siginfo_t;
On 64-bit, the int is 4 bytes long, so it must the union member that has
not been initialized. We can verify this using gdb::
$ gdb vmlinux
...
(gdb) p &((struct siginfo *) 0)->_sifields
$1 = (union {...} *) 0x10
Actually, it seems that the union member is located at offset 0x10 -- which
means that gcc has inserted 4 bytes of padding between the members ``si_code``
and ``_sifields``. We can now get a fuller picture of the memory dump::
_----------------------------=> si_code
/ _--------------------=> (padding)
| / _------------=> _sifields(._kill._pid)
| | / _----=> _sifields(._kill._uid)
| | | /
-------|-------|-------|-------|
80000000000000000000000000000000000000000088ffff0000000000000000
i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u
This allows us to realize another important fact: ``si_code`` contains the
value 0x80. Remember that x86 is little endian, so the first 4 bytes
"80000000" are really the number 0x00000080. With a bit of research, we
find that this is actually the constant ``SI_KERNEL`` defined in
``include/asm-generic/siginfo.h``::
144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */
This macro is used in exactly one place in the x86 kernel: In ``send_signal()``
in ``kernel/signal.c``::
816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
817 int group)
818 {
...
828 pending = group ? &t->signal->shared_pending : &t->pending;
...
851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN &&
852 (is_si_special(info) ||
853 info->si_code >= 0)));
854 if (q) {
855 list_add_tail(&q->list, &pending->list);
856 switch ((unsigned long) info) {
...
865 case (unsigned long) SEND_SIG_PRIV:
866 q->info.si_signo = sig;
867 q->info.si_errno = 0;
868 q->info.si_code = SI_KERNEL;
869 q->info.si_pid = 0;
870 q->info.si_uid = 0;
871 break;
...
890 }
Not only does this match with the ``.si_code`` member, it also matches the place
we found earlier when looking for where siginfo_t objects are enqueued on the
``shared_pending`` list.
So to sum up: It seems that it is the padding introduced by the compiler
between two struct fields that is uninitialized, and this gets reported when
we do a ``memcpy()`` on the struct. This means that we have identified a false
positive warning.
Normally, kmemcheck will not report uninitialized accesses in ``memcpy()`` calls
when both the source and destination addresses are tracked. (Instead, we copy
the shadow bytemap as well). In this case, the destination address clearly
was not tracked. We can dig a little deeper into the stack trace from above::
arch/x86/kernel/signal.c:805
arch/x86/kernel/signal.c:871
arch/x86/kernel/entry_64.S:694
And we clearly see that the destination siginfo object is located on the
stack::
782 static void do_signal(struct pt_regs *regs)
783 {
784 struct k_sigaction ka;
785 siginfo_t info;
...
804 signr = get_signal_to_deliver(&info, &ka, regs, NULL);
...
854 }
And this ``&info`` is what eventually gets passed to ``copy_siginfo()`` as the
destination argument.
Now, even though we didn't find an actual error here, the example is still a
good one, because it shows how one would go about to find out what the report
was all about.
Annotating false positives
~~~~~~~~~~~~~~~~~~~~~~~~~~
There are a few different ways to make annotations in the source code that
will keep kmemcheck from checking and reporting certain allocations. Here
they are:
- ``__GFP_NOTRACK_FALSE_POSITIVE``
This flag can be passed to ``kmalloc()`` or ``kmem_cache_alloc()``
(therefore also to other functions that end up calling one of
these) to indicate that the allocation should not be tracked
because it would lead to a false positive report. This is a "big
hammer" way of silencing kmemcheck; after all, even if the false
positive pertains to particular field in a struct, for example, we
will now lose the ability to find (real) errors in other parts of
the same struct.
Example::
/* No warnings will ever trigger on accessing any part of x */
x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE);
- ``kmemcheck_bitfield_begin(name)``/``kmemcheck_bitfield_end(name)`` and
``kmemcheck_annotate_bitfield(ptr, name)``
The first two of these three macros can be used inside struct
definitions to signal, respectively, the beginning and end of a
bitfield. Additionally, this will assign the bitfield a name, which
is given as an argument to the macros.
Having used these markers, one can later use
kmemcheck_annotate_bitfield() at the point of allocation, to indicate
which parts of the allocation is part of a bitfield.
Example::
struct foo {
int x;
kmemcheck_bitfield_begin(flags);
int flag_a:1;
int flag_b:1;
kmemcheck_bitfield_end(flags);
int y;
};
struct foo *x = kmalloc(sizeof *x);
/* No warnings will trigger on accessing the bitfield of x */
kmemcheck_annotate_bitfield(x, flags);
Note that ``kmemcheck_annotate_bitfield()`` can be used even before the
return value of ``kmalloc()`` is checked -- in other words, passing NULL
as the first argument is legal (and will do nothing).
Reporting errors
----------------
As we have seen, kmemcheck will produce false positive reports. Therefore, it
is not very wise to blindly post kmemcheck warnings to mailing lists and
maintainers. Instead, I encourage maintainers and developers to find errors
in their own code. If you get a warning, you can try to work around it, try
to figure out if it's a real error or not, or simply ignore it. Most
developers know their own code and will quickly and efficiently determine the
root cause of a kmemcheck report. This is therefore also the most efficient
way to work with kmemcheck.
That said, we (the kmemcheck maintainers) will always be on the lookout for
false positives that we can annotate and silence. So whatever you find,
please drop us a note privately! Kernel configs and steps to reproduce (if
available) are of course a great help too.
Happy hacking!
Technical description
---------------------
kmemcheck works by marking memory pages non-present. This means that whenever
somebody attempts to access the page, a page fault is generated. The page
fault handler notices that the page was in fact only hidden, and so it calls
on the kmemcheck code to make further investigations.
When the investigations are completed, kmemcheck "shows" the page by marking
it present (as it would be under normal circumstances). This way, the
interrupted code can continue as usual.
But after the instruction has been executed, we should hide the page again, so
that we can catch the next access too! Now kmemcheck makes use of a debugging
feature of the processor, namely single-stepping. When the processor has
finished the one instruction that generated the memory access, a debug
exception is raised. From here, we simply hide the page again and continue
execution, this time with the single-stepping feature turned off.
kmemcheck requires some assistance from the memory allocator in order to work.
The memory allocator needs to
1. Tell kmemcheck about newly allocated pages and pages that are about to
be freed. This allows kmemcheck to set up and tear down the shadow memory
for the pages in question. The shadow memory stores the status of each
byte in the allocation proper, e.g. whether it is initialized or
uninitialized.
2. Tell kmemcheck which parts of memory should be marked uninitialized.
There are actually a few more states, such as "not yet allocated" and
"recently freed".
If a slab cache is set up using the SLAB_NOTRACK flag, it will never return
memory that can take page faults because of kmemcheck.
If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still
request memory with the __GFP_NOTRACK or __GFP_NOTRACK_FALSE_POSITIVE flags.
This does not prevent the page faults from occurring, however, but marks the
object in question as being initialized so that no warnings will ever be
produced for this object.
Currently, the SLAB and SLUB allocators are supported by kmemcheck.

View File

@@ -0,0 +1,210 @@
Kernel Memory Leak Detector
===========================
Kmemleak provides a way of detecting possible kernel memory leaks in a
way similar to a tracing garbage collector
(https://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29#Tracing_garbage_collectors),
with the difference that the orphan objects are not freed but only
reported via /sys/kernel/debug/kmemleak. A similar method is used by the
Valgrind tool (``memcheck --leak-check``) to detect the memory leaks in
user-space applications.
Kmemleak is supported on x86, arm, powerpc, sparc, sh, microblaze, ppc, mips, s390, metag and tile.
Usage
-----
CONFIG_DEBUG_KMEMLEAK in "Kernel hacking" has to be enabled. A kernel
thread scans the memory every 10 minutes (by default) and prints the
number of new unreferenced objects found. To display the details of all
the possible memory leaks::
# mount -t debugfs nodev /sys/kernel/debug/
# cat /sys/kernel/debug/kmemleak
To trigger an intermediate memory scan::
# echo scan > /sys/kernel/debug/kmemleak
To clear the list of all current possible memory leaks::
# echo clear > /sys/kernel/debug/kmemleak
New leaks will then come up upon reading ``/sys/kernel/debug/kmemleak``
again.
Note that the orphan objects are listed in the order they were allocated
and one object at the beginning of the list may cause other subsequent
objects to be reported as orphan.
Memory scanning parameters can be modified at run-time by writing to the
``/sys/kernel/debug/kmemleak`` file. The following parameters are supported:
- off
disable kmemleak (irreversible)
- stack=on
enable the task stacks scanning (default)
- stack=off
disable the tasks stacks scanning
- scan=on
start the automatic memory scanning thread (default)
- scan=off
stop the automatic memory scanning thread
- scan=<secs>
set the automatic memory scanning period in seconds
(default 600, 0 to stop the automatic scanning)
- scan
trigger a memory scan
- clear
clear list of current memory leak suspects, done by
marking all current reported unreferenced objects grey,
or free all kmemleak objects if kmemleak has been disabled.
- dump=<addr>
dump information about the object found at <addr>
Kmemleak can also be disabled at boot-time by passing ``kmemleak=off`` on
the kernel command line.
Memory may be allocated or freed before kmemleak is initialised and
these actions are stored in an early log buffer. The size of this buffer
is configured via the CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE option.
If CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF are enabled, the kmemleak is
disabled by default. Passing ``kmemleak=on`` on the kernel command
line enables the function.
Basic Algorithm
---------------
The memory allocations via :c:func:`kmalloc`, :c:func:`vmalloc`,
:c:func:`kmem_cache_alloc` and
friends are traced and the pointers, together with additional
information like size and stack trace, are stored in a rbtree.
The corresponding freeing function calls are tracked and the pointers
removed from the kmemleak data structures.
An allocated block of memory is considered orphan if no pointer to its
start address or to any location inside the block can be found by
scanning the memory (including saved registers). This means that there
might be no way for the kernel to pass the address of the allocated
block to a freeing function and therefore the block is considered a
memory leak.
The scanning algorithm steps:
1. mark all objects as white (remaining white objects will later be
considered orphan)
2. scan the memory starting with the data section and stacks, checking
the values against the addresses stored in the rbtree. If
a pointer to a white object is found, the object is added to the
gray list
3. scan the gray objects for matching addresses (some white objects
can become gray and added at the end of the gray list) until the
gray set is finished
4. the remaining white objects are considered orphan and reported via
/sys/kernel/debug/kmemleak
Some allocated memory blocks have pointers stored in the kernel's
internal data structures and they cannot be detected as orphans. To
avoid this, kmemleak can also store the number of values pointing to an
address inside the block address range that need to be found so that the
block is not considered a leak. One example is __vmalloc().
Testing specific sections with kmemleak
---------------------------------------
Upon initial bootup your /sys/kernel/debug/kmemleak output page may be
quite extensive. This can also be the case if you have very buggy code
when doing development. To work around these situations you can use the
'clear' command to clear all reported unreferenced objects from the
/sys/kernel/debug/kmemleak output. By issuing a 'scan' after a 'clear'
you can find new unreferenced objects; this should help with testing
specific sections of code.
To test a critical section on demand with a clean kmemleak do::
# echo clear > /sys/kernel/debug/kmemleak
... test your kernel or modules ...
# echo scan > /sys/kernel/debug/kmemleak
Then as usual to get your report with::
# cat /sys/kernel/debug/kmemleak
Freeing kmemleak internal objects
---------------------------------
To allow access to previously found memory leaks after kmemleak has been
disabled by the user or due to an fatal error, internal kmemleak objects
won't be freed when kmemleak is disabled, and those objects may occupy
a large part of physical memory.
In this situation, you may reclaim memory with::
# echo clear > /sys/kernel/debug/kmemleak
Kmemleak API
------------
See the include/linux/kmemleak.h header for the functions prototype.
- ``kmemleak_init`` - initialize kmemleak
- ``kmemleak_alloc`` - notify of a memory block allocation
- ``kmemleak_alloc_percpu`` - notify of a percpu memory block allocation
- ``kmemleak_free`` - notify of a memory block freeing
- ``kmemleak_free_part`` - notify of a partial memory block freeing
- ``kmemleak_free_percpu`` - notify of a percpu memory block freeing
- ``kmemleak_update_trace`` - update object allocation stack trace
- ``kmemleak_not_leak`` - mark an object as not a leak
- ``kmemleak_ignore`` - do not scan or report an object as leak
- ``kmemleak_scan_area`` - add scan areas inside a memory block
- ``kmemleak_no_scan`` - do not scan a memory block
- ``kmemleak_erase`` - erase an old value in a pointer variable
- ``kmemleak_alloc_recursive`` - as kmemleak_alloc but checks the recursiveness
- ``kmemleak_free_recursive`` - as kmemleak_free but checks the recursiveness
Dealing with false positives/negatives
--------------------------------------
The false negatives are real memory leaks (orphan objects) but not
reported by kmemleak because values found during the memory scanning
point to such objects. To reduce the number of false negatives, kmemleak
provides the kmemleak_ignore, kmemleak_scan_area, kmemleak_no_scan and
kmemleak_erase functions (see above). The task stacks also increase the
amount of false negatives and their scanning is not enabled by default.
The false positives are objects wrongly reported as being memory leaks
(orphan). For objects known not to be leaks, kmemleak provides the
kmemleak_not_leak function. The kmemleak_ignore could also be used if
the memory block is known not to contain other pointers and it will no
longer be scanned.
Some of the reported leaks are only transient, especially on SMP
systems, because of pointers temporarily stored in CPU registers or
stacks. Kmemleak defines MSECS_MIN_AGE (defaulting to 1000) representing
the minimum age of an object to be reported as a memory leak.
Limitations and Drawbacks
-------------------------
The main drawback is the reduced performance of memory allocation and
freeing. To avoid other penalties, the memory scanning is only performed
when the /sys/kernel/debug/kmemleak file is read. Anyway, this tool is
intended for debugging purposes where the performance might not be the
most important requirement.
To keep the algorithm simple, kmemleak scans for values pointing to any
address inside a block's address range. This may lead to an increased
number of false negatives. However, it is likely that a real memory leak
will eventually become visible.
Another source of false negatives is the data stored in non-pointer
values. In a future version, kmemleak could only scan the pointer
members in the allocated structures. This feature would solve many of
the false negative cases described above.
The tool can report false positives. These are cases where an allocated
block doesn't need to be freed (some cases in the init_call functions),
the pointer is calculated by other methods than the usual container_of
macro or the pointer is stored in a location not scanned by kmemleak.
Page allocations and ioremap are not tracked.

View File

@@ -0,0 +1,117 @@
.. Copyright 2004 Linus Torvalds
.. Copyright 2004 Pavel Machek <pavel@ucw.cz>
.. Copyright 2006 Bob Copeland <me@bobcopeland.com>
Sparse
======
Sparse is a semantic checker for C programs; it can be used to find a
number of potential problems with kernel code. See
https://lwn.net/Articles/689907/ for an overview of sparse; this document
contains some kernel-specific sparse information.
Using sparse for typechecking
-----------------------------
"__bitwise" is a type attribute, so you have to do something like this::
typedef int __bitwise pm_request_t;
enum pm_request {
PM_SUSPEND = (__force pm_request_t) 1,
PM_RESUME = (__force pm_request_t) 2
};
which makes PM_SUSPEND and PM_RESUME "bitwise" integers (the "__force" is
there because sparse will complain about casting to/from a bitwise type,
but in this case we really _do_ want to force the conversion). And because
the enum values are all the same type, now "enum pm_request" will be that
type too.
And with gcc, all the "__bitwise"/"__force stuff" goes away, and it all
ends up looking just like integers to gcc.
Quite frankly, you don't need the enum there. The above all really just
boils down to one special "int __bitwise" type.
So the simpler way is to just do::
typedef int __bitwise pm_request_t;
#define PM_SUSPEND ((__force pm_request_t) 1)
#define PM_RESUME ((__force pm_request_t) 2)
and you now have all the infrastructure needed for strict typechecking.
One small note: the constant integer "0" is special. You can use a
constant zero as a bitwise integer type without sparse ever complaining.
This is because "bitwise" (as the name implies) was designed for making
sure that bitwise types don't get mixed up (little-endian vs big-endian
vs cpu-endian vs whatever), and there the constant "0" really _is_
special.
__bitwise__ - to be used for relatively compact stuff (gfp_t, etc.) that
is mostly warning-free and is supposed to stay that way. Warnings will
be generated without __CHECK_ENDIAN__.
__bitwise - noisy stuff; in particular, __le*/__be* are that. We really
don't want to drown in noise unless we'd explicitly asked for it.
Using sparse for lock checking
------------------------------
The following macros are undefined for gcc and defined during a sparse
run to use the "context" tracking feature of sparse, applied to
locking. These annotations tell sparse when a lock is held, with
regard to the annotated function's entry and exit.
__must_hold - The specified lock is held on function entry and exit.
__acquires - The specified lock is held on function exit, but not entry.
__releases - The specified lock is held on function entry, but not exit.
If the function enters and exits without the lock held, acquiring and
releasing the lock inside the function in a balanced way, no
annotation is needed. The tree annotations above are for cases where
sparse would otherwise report a context imbalance.
Getting sparse
--------------
You can get latest released versions from the Sparse homepage at
https://sparse.wiki.kernel.org/index.php/Main_Page
Alternatively, you can get snapshots of the latest development version
of sparse using git to clone::
git://git.kernel.org/pub/scm/devel/sparse/sparse.git
DaveJ has hourly generated tarballs of the git tree available at::
http://www.codemonkey.org.uk/projects/git-snapshots/sparse/
Once you have it, just do::
make
make install
as a regular user, and it will install sparse in your ~/bin directory.
Using sparse
------------
Do a kernel make with "make C=1" to run sparse on all the C files that get
recompiled, or use "make C=2" to run sparse on the files whether they need to
be recompiled or not. The latter is a fast way to check the whole tree if you
have already built it.
The optional make variable CF can be used to pass arguments to sparse. The
build system passes -Wbitwise to sparse automatically. To perform endianness
checks, you may define __CHECK_ENDIAN__::
make C=2 CF="-D__CHECK_ENDIAN__"
These checks are disabled by default as they generate a host of warnings.

View File

@@ -0,0 +1,25 @@
================================
Development tools for the kernel
================================
This document is a collection of documents about development tools that can
be used to work on the kernel. For now, the documents have been pulled
together without any significant effot to integrate them into a coherent
whole; patches welcome!
.. class:: toc-title
Table of contents
.. toctree::
:maxdepth: 2
coccinelle
sparse
kcov
gcov
kasan
ubsan
kmemleak
kmemcheck
gdb-kernel-debugging

View File

@@ -0,0 +1,88 @@
The Undefined Behavior Sanitizer - UBSAN
========================================
UBSAN is a runtime undefined behaviour checker.
UBSAN uses compile-time instrumentation to catch undefined behavior (UB).
Compiler inserts code that perform certain kinds of checks before operations
that may cause UB. If check fails (i.e. UB detected) __ubsan_handle_*
function called to print error message.
GCC has that feature since 4.9.x [1_] (see ``-fsanitize=undefined`` option and
its suboptions). GCC 5.x has more checkers implemented [2_].
Report example
--------------
::
================================================================================
UBSAN: Undefined behaviour in ../include/linux/bitops.h:110:33
shift exponent 32 is to large for 32-bit type 'unsigned int'
CPU: 0 PID: 0 Comm: swapper Not tainted 4.4.0-rc1+ #26
0000000000000000 ffffffff82403cc8 ffffffff815e6cd6 0000000000000001
ffffffff82403cf8 ffffffff82403ce0 ffffffff8163a5ed 0000000000000020
ffffffff82403d78 ffffffff8163ac2b ffffffff815f0001 0000000000000002
Call Trace:
[<ffffffff815e6cd6>] dump_stack+0x45/0x5f
[<ffffffff8163a5ed>] ubsan_epilogue+0xd/0x40
[<ffffffff8163ac2b>] __ubsan_handle_shift_out_of_bounds+0xeb/0x130
[<ffffffff815f0001>] ? radix_tree_gang_lookup_slot+0x51/0x150
[<ffffffff8173c586>] _mix_pool_bytes+0x1e6/0x480
[<ffffffff83105653>] ? dmi_walk_early+0x48/0x5c
[<ffffffff8173c881>] add_device_randomness+0x61/0x130
[<ffffffff83105b35>] ? dmi_save_one_device+0xaa/0xaa
[<ffffffff83105653>] dmi_walk_early+0x48/0x5c
[<ffffffff831066ae>] dmi_scan_machine+0x278/0x4b4
[<ffffffff8111d58a>] ? vprintk_default+0x1a/0x20
[<ffffffff830ad120>] ? early_idt_handler_array+0x120/0x120
[<ffffffff830b2240>] setup_arch+0x405/0xc2c
[<ffffffff830ad120>] ? early_idt_handler_array+0x120/0x120
[<ffffffff830ae053>] start_kernel+0x83/0x49a
[<ffffffff830ad120>] ? early_idt_handler_array+0x120/0x120
[<ffffffff830ad386>] x86_64_start_reservations+0x2a/0x2c
[<ffffffff830ad4f3>] x86_64_start_kernel+0x16b/0x17a
================================================================================
Usage
-----
To enable UBSAN configure kernel with::
CONFIG_UBSAN=y
and to check the entire kernel::
CONFIG_UBSAN_SANITIZE_ALL=y
To enable instrumentation for specific files or directories, add a line
similar to the following to the respective kernel Makefile:
- For a single file (e.g. main.o)::
UBSAN_SANITIZE_main.o := y
- For all files in one directory::
UBSAN_SANITIZE := y
To exclude files from being instrumented even if
``CONFIG_UBSAN_SANITIZE_ALL=y``, use::
UBSAN_SANITIZE_main.o := n
and::
UBSAN_SANITIZE := n
Detection of unaligned accesses controlled through the separate option -
CONFIG_UBSAN_ALIGNMENT. It's off by default on architectures that support
unaligned accesses (CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y). One could
still enable it in config, just note that it will produce a lot of UBSAN
reports.
References
----------
.. _1: https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Debugging-Options.html
.. _2: https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html