123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381 |
- ======================
- Firmware-Assisted Dump
- ======================
- July 2011
- The goal of firmware-assisted dump is to enable the dump of
- a crashed system, and to do so from a fully-reset system, and
- to minimize the total elapsed time until the system is back
- in production use.
- - Firmware-Assisted Dump (FADump) infrastructure is intended to replace
- the existing phyp assisted dump.
- - Fadump uses the same firmware interfaces and memory reservation model
- as phyp assisted dump.
- - Unlike phyp dump, FADump exports the memory dump through /proc/vmcore
- in the ELF format in the same way as kdump. This helps us reuse the
- kdump infrastructure for dump capture and filtering.
- - Unlike phyp dump, userspace tool does not need to refer any sysfs
- interface while reading /proc/vmcore.
- - Unlike phyp dump, FADump allows user to release all the memory reserved
- for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem.
- - Once enabled through kernel boot parameter, FADump can be
- started/stopped through /sys/kernel/fadump_registered interface (see
- sysfs files section below) and can be easily integrated with kdump
- service start/stop init scripts.
- Comparing with kdump or other strategies, firmware-assisted
- dump offers several strong, practical advantages:
- - Unlike kdump, the system has been reset, and loaded
- with a fresh copy of the kernel. In particular,
- PCI and I/O devices have been reinitialized and are
- in a clean, consistent state.
- - Once the dump is copied out, the memory that held the dump
- is immediately available to the running kernel. And therefore,
- unlike kdump, FADump doesn't need a 2nd reboot to get back
- the system to the production configuration.
- The above can only be accomplished by coordination with,
- and assistance from the Power firmware. The procedure is
- as follows:
- - The first kernel registers the sections of memory with the
- Power firmware for dump preservation during OS initialization.
- These registered sections of memory are reserved by the first
- kernel during early boot.
- - When system crashes, the Power firmware will copy the registered
- low memory regions (boot memory) from source to destination area.
- It will also save hardware PTE's.
- NOTE:
- The term 'boot memory' means size of the low memory chunk
- that is required for a kernel to boot successfully when
- booted with restricted memory. By default, the boot memory
- size will be the larger of 5% of system RAM or 256MB.
- Alternatively, user can also specify boot memory size
- through boot parameter 'crashkernel=' which will override
- the default calculated size. Use this option if default
- boot memory size is not sufficient for second kernel to
- boot successfully. For syntax of crashkernel= parameter,
- refer to Documentation/admin-guide/kdump/kdump.rst. If any
- offset is provided in crashkernel= parameter, it will be
- ignored as FADump uses a predefined offset to reserve memory
- for boot memory dump preservation in case of a crash.
- - After the low memory (boot memory) area has been saved, the
- firmware will reset PCI and other hardware state. It will
- *not* clear the RAM. It will then launch the bootloader, as
- normal.
- - The freshly booted kernel will notice that there is a new node
- (rtas/ibm,kernel-dump on pSeries or ibm,opal/dump/mpipl-boot
- on OPAL platform) in the device tree, indicating that
- there is crash data available from a previous boot. During
- the early boot OS will reserve rest of the memory above
- boot memory size effectively booting with restricted memory
- size. This will make sure that this kernel (also, referred
- to as second kernel or capture kernel) will not touch any
- of the dump memory area.
- - User-space tools will read /proc/vmcore to obtain the contents
- of memory, which holds the previous crashed kernel dump in ELF
- format. The userspace tools may copy this info to disk, or
- network, nas, san, iscsi, etc. as desired.
- - Once the userspace tool is done saving dump, it will echo
- '1' to /sys/kernel/fadump_release_mem to release the reserved
- memory back to general use, except the memory required for
- next firmware-assisted dump registration.
- e.g.::
- # echo 1 > /sys/kernel/fadump_release_mem
- Please note that the firmware-assisted dump feature
- is only available on POWER6 and above systems on pSeries
- (PowerVM) platform and POWER9 and above systems with OP940
- or later firmware versions on PowerNV (OPAL) platform.
- Note that, OPAL firmware exports ibm,opal/dump node when
- FADump is supported on PowerNV platform.
- On OPAL based machines, system first boots into an intermittent
- kernel (referred to as petitboot kernel) before booting into the
- capture kernel. This kernel would have minimal kernel and/or
- userspace support to process crash data. Such kernel needs to
- preserve previously crash'ed kernel's memory for the subsequent
- capture kernel boot to process this crash data. Kernel config
- option CONFIG_PRESERVE_FA_DUMP has to be enabled on such kernel
- to ensure that crash data is preserved to process later.
- -- On OPAL based machines (PowerNV), if the kernel is build with
- CONFIG_OPAL_CORE=y, OPAL memory at the time of crash is also
- exported as /sys/firmware/opal/mpipl/core file. This procfs file is
- helpful in debugging OPAL crashes with GDB. The kernel memory
- used for exporting this procfs file can be released by echo'ing
- '1' to /sys/firmware/opal/mpipl/release_core node.
- e.g.
- # echo 1 > /sys/firmware/opal/mpipl/release_core
- Implementation details:
- -----------------------
- During boot, a check is made to see if firmware supports
- this feature on that particular machine. If it does, then
- we check to see if an active dump is waiting for us. If yes
- then everything but boot memory size of RAM is reserved during
- early boot (See Fig. 2). This area is released once we finish
- collecting the dump from user land scripts (e.g. kdump scripts)
- that are run. If there is dump data, then the
- /sys/kernel/fadump_release_mem file is created, and the reserved
- memory is held.
- If there is no waiting dump data, then only the memory required to
- hold CPU state, HPTE region, boot memory dump, FADump header and
- elfcore header, is usually reserved at an offset greater than boot
- memory size (see Fig. 1). This area is *not* released: this region
- will be kept permanently reserved, so that it can act as a receptacle
- for a copy of the boot memory content in addition to CPU state and
- HPTE region, in the case a crash does occur.
- Since this reserved memory area is used only after the system crash,
- there is no point in blocking this significant chunk of memory from
- production kernel. Hence, the implementation uses the Linux kernel's
- Contiguous Memory Allocator (CMA) for memory reservation if CMA is
- configured for kernel. With CMA reservation this memory will be
- available for applications to use it, while kernel is prevented from
- using it. With this FADump will still be able to capture all of the
- kernel memory and most of the user space memory except the user pages
- that were present in CMA region::
- o Memory Reservation during first kernel
- Low memory Top of memory
- 0 boot memory size |<--- Reserved dump area --->| |
- | | | Permanent Reservation | |
- V V | | V
- +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
- | | |///|////| DUMP | HDR | ELF |////| |
- +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
- | ^ ^ ^ ^ ^
- | | | | | |
- \ CPU HPTE / | |
- ------------------------------ | |
- Boot memory content gets transferred | |
- to reserved area by firmware at the | |
- time of crash. | |
- FADump Header |
- (meta area) |
- |
- |
- Metadata: This area holds a metadata structure whose
- address is registered with f/w and retrieved in the
- second kernel after crash, on platforms that support
- tags (OPAL). Having such structure with info needed
- to process the crashdump eases dump capture process.
- Fig. 1
- o Memory Reservation during second kernel after crash
- Low memory Top of memory
- 0 boot memory size |
- | |<------------ Crash preserved area ------------>|
- V V |<--- Reserved dump area --->| |
- +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
- | | |///|////| DUMP | HDR | ELF |////| |
- +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
- | |
- V V
- Used by second /proc/vmcore
- kernel to boot
- +---+
- |///| -> Regions (CPU, HPTE & Metadata) marked like this in the above
- +---+ figures are not always present. For example, OPAL platform
- does not have CPU & HPTE regions while Metadata region is
- not supported on pSeries currently.
- Fig. 2
- Currently the dump will be copied from /proc/vmcore to a new file upon
- user intervention. The dump data available through /proc/vmcore will be
- in ELF format. Hence the existing kdump infrastructure (kdump scripts)
- to save the dump works fine with minor modifications. KDump scripts on
- major Distro releases have already been modified to work seamlessly (no
- user intervention in saving the dump) when FADump is used, instead of
- KDump, as dump mechanism.
- The tools to examine the dump will be same as the ones
- used for kdump.
- How to enable firmware-assisted dump (FADump):
- ----------------------------------------------
- 1. Set config option CONFIG_FA_DUMP=y and build kernel.
- 2. Boot into linux kernel with 'fadump=on' kernel cmdline option.
- By default, FADump reserved memory will be initialized as CMA area.
- Alternatively, user can boot linux kernel with 'fadump=nocma' to
- prevent FADump to use CMA.
- 3. Optionally, user can also set 'crashkernel=' kernel cmdline
- to specify size of the memory to reserve for boot memory dump
- preservation.
- NOTE:
- 1. 'fadump_reserve_mem=' parameter has been deprecated. Instead
- use 'crashkernel=' to specify size of the memory to reserve
- for boot memory dump preservation.
- 2. If firmware-assisted dump fails to reserve memory then it
- will fallback to existing kdump mechanism if 'crashkernel='
- option is set at kernel cmdline.
- 3. if user wants to capture all of user space memory and ok with
- reserved memory not available to production system, then
- 'fadump=nocma' kernel parameter can be used to fallback to
- old behaviour.
- Sysfs/debugfs files:
- --------------------
- Firmware-assisted dump feature uses sysfs file system to hold
- the control files and debugfs file to display memory reserved region.
- Here is the list of files under kernel sysfs:
- /sys/kernel/fadump_enabled
- This is used to display the FADump status.
- - 0 = FADump is disabled
- - 1 = FADump is enabled
- This interface can be used by kdump init scripts to identify if
- FADump is enabled in the kernel and act accordingly.
- /sys/kernel/fadump_registered
- This is used to display the FADump registration status as well
- as to control (start/stop) the FADump registration.
- - 0 = FADump is not registered.
- - 1 = FADump is registered and ready to handle system crash.
- To register FADump echo 1 > /sys/kernel/fadump_registered and
- echo 0 > /sys/kernel/fadump_registered for un-register and stop the
- FADump. Once the FADump is un-registered, the system crash will not
- be handled and vmcore will not be captured. This interface can be
- easily integrated with kdump service start/stop.
- /sys/kernel/fadump/mem_reserved
- This is used to display the memory reserved by FADump for saving the
- crash dump.
- /sys/kernel/fadump_release_mem
- This file is available only when FADump is active during
- second kernel. This is used to release the reserved memory
- region that are held for saving crash dump. To release the
- reserved memory echo 1 to it::
- echo 1 > /sys/kernel/fadump_release_mem
- After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region
- file will change to reflect the new memory reservations.
- The existing userspace tools (kdump infrastructure) can be easily
- enhanced to use this interface to release the memory reserved for
- dump and continue without 2nd reboot.
- Note: /sys/kernel/fadump_release_opalcore sysfs has moved to
- /sys/firmware/opal/mpipl/release_core
- /sys/firmware/opal/mpipl/release_core
- This file is available only on OPAL based machines when FADump is
- active during capture kernel. This is used to release the memory
- used by the kernel to export /sys/firmware/opal/mpipl/core file. To
- release this memory, echo '1' to it:
- echo 1 > /sys/firmware/opal/mpipl/release_core
- Note: The following FADump sysfs files are deprecated.
- +----------------------------------+--------------------------------+
- | Deprecated | Alternative |
- +----------------------------------+--------------------------------+
- | /sys/kernel/fadump_enabled | /sys/kernel/fadump/enabled |
- +----------------------------------+--------------------------------+
- | /sys/kernel/fadump_registered | /sys/kernel/fadump/registered |
- +----------------------------------+--------------------------------+
- | /sys/kernel/fadump_release_mem | /sys/kernel/fadump/release_mem |
- +----------------------------------+--------------------------------+
- Here is the list of files under powerpc debugfs:
- (Assuming debugfs is mounted on /sys/kernel/debug directory.)
- /sys/kernel/debug/powerpc/fadump_region
- This file shows the reserved memory regions if FADump is
- enabled otherwise this file is empty. The output format
- is::
- <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size>
- and for kernel DUMP region is:
- DUMP: Src: <src-addr>, Dest: <dest-addr>, Size: <size>, Dumped: # bytes
- e.g.
- Contents when FADump is registered during first kernel::
- # cat /sys/kernel/debug/powerpc/fadump_region
- CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0
- HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0
- DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0
- Contents when FADump is active during second kernel::
- # cat /sys/kernel/debug/powerpc/fadump_region
- CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020
- HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000
- DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000
- : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000
- NOTE:
- Please refer to Documentation/filesystems/debugfs.rst on
- how to mount the debugfs filesystem.
- TODO:
- -----
- - Need to come up with the better approach to find out more
- accurate boot memory size that is required for a kernel to
- boot successfully when booted with restricted memory.
- - The FADump implementation introduces a FADump crash info structure
- in the scratch area before the ELF core header. The idea of introducing
- this structure is to pass some important crash info data to the second
- kernel which will help second kernel to populate ELF core header with
- correct data before it gets exported through /proc/vmcore. The current
- design implementation does not address a possibility of introducing
- additional fields (in future) to this structure without affecting
- compatibility. Need to come up with the better approach to address this.
- The possible approaches are:
- 1. Introduce version field for version tracking, bump up the version
- whenever a new field is added to the structure in future. The version
- field can be used to find out what fields are valid for the current
- version of the structure.
- 2. Reserve the area of predefined size (say PAGE_SIZE) for this
- structure and have unused area as reserved (initialized to zero)
- for future field additions.
- The advantage of approach 1 over 2 is we don't need to reserve extra space.
- Author: Mahesh Salgaonkar <[email protected]>
- This document is based on the original documentation written for phyp
- assisted dump by Linas Vepstas and Manish Ahuja.
|