Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc merge from Benjamin Herrenschmidt: "Here's the powerpc batch for this merge window. It is going to be a bit more nasty than usual as in touching things outside of arch/powerpc mostly due to the big iSeriesectomy :-) We finally got rid of the bugger (legacy iSeries support) which was a PITA to maintain and that nobody really used anymore. Here are some of the highlights: - Legacy iSeries is gone. Thanks Stephen ! There's still some bits and pieces remaining if you do a grep -ir series arch/powerpc but they are harmless and will be removed in the next few weeks hopefully. - The 'fadump' functionality (Firmware Assisted Dump) replaces the previous (equivalent) "pHyp assisted dump"... it's a rewrite of a mechanism to get the hypervisor to do crash dumps on pSeries, the new implementation hopefully being much more reliable. Thanks Mahesh Salgaonkar. - The "EEH" code (pSeries PCI error handling & recovery) got a big spring cleaning, motivated by the need to be able to implement a new backend for it on top of some new different type of firwmare. The work isn't complete yet, but a good chunk of the cleanups is there. Note that this adds a field to struct device_node which is not very nice and which Grant objects to. I will have a patch soon that moves that to a powerpc private data structure (hopefully before rc1) and we'll improve things further later on (hopefully getting rid of the need for that pointer completely). Thanks Gavin Shan. - I dug into our exception & interrupt handling code to improve the way we do lazy interrupt handling (and make it work properly with "edge" triggered interrupt sources), and while at it found & fixed a wagon of issues in those areas, including adding support for page fault retry & fatal signals on page faults. - Your usual random batch of small fixes & updates, including a bunch of new embedded boards, both Freescale and APM based ones, etc..." I fixed up some conflicts with the generalized irq-domain changes from Grant Likely, hopefully correctly. * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (141 commits) powerpc/ps3: Do not adjust the wrapper load address powerpc: Remove the rest of the legacy iSeries include files powerpc: Remove the remaining CONFIG_PPC_ISERIES pieces init: Remove CONFIG_PPC_ISERIES powerpc: Remove FW_FEATURE ISERIES from arch code tty/hvc_vio: FW_FEATURE_ISERIES is no longer selectable powerpc/spufs: Fix double unlocks powerpc/5200: convert mpc5200 to use of_platform_populate() powerpc/mpc5200: add options to mpc5200_defconfig powerpc/mpc52xx: add a4m072 board support powerpc/mpc5200: update mpc5200_defconfig to fit for charon board Documentation/powerpc/mpc52xx.txt: Checkpatch cleanup powerpc/44x: Add additional device support for APM821xx SoC and Bluestone board powerpc/44x: Add support PCI-E for APM821xx SoC and Bluestone board MAINTAINERS: Update PowerPC 4xx tree powerpc/44x: The bug fixed support for APM821xx SoC and Bluestone board powerpc: document the FSL MPIC message register binding powerpc: add support for MPIC message register API powerpc/fsl: Added aliased MSIIR register address to MSI node in dts powerpc/85xx: mpc8548cds - add 36-bit dts ...
This commit is contained in:
63
Documentation/devicetree/bindings/powerpc/fsl/mpic-msgr.txt
Normal file
63
Documentation/devicetree/bindings/powerpc/fsl/mpic-msgr.txt
Normal file
@@ -0,0 +1,63 @@
|
||||
* FSL MPIC Message Registers
|
||||
|
||||
This binding specifies what properties must be available in the device tree
|
||||
representation of the message register blocks found in some FSL MPIC
|
||||
implementations.
|
||||
|
||||
Required properties:
|
||||
|
||||
- compatible: Specifies the compatibility list for the message register
|
||||
block. The type shall be <string-list> and the value shall be of the form
|
||||
"fsl,mpic-v<version>-msgr", where <version> is the version number of
|
||||
the MPIC containing the message registers.
|
||||
|
||||
- reg: Specifies the base physical address(s) and size(s) of the
|
||||
message register block's addressable register space. The type shall be
|
||||
<prop-encoded-array>.
|
||||
|
||||
- interrupts: Specifies a list of interrupt-specifiers which are available
|
||||
for receiving interrupts. Interrupt-specifier consists of two cells: first
|
||||
cell is interrupt-number and second cell is level-sense. The type shall be
|
||||
<prop-encoded-array>.
|
||||
|
||||
Optional properties:
|
||||
|
||||
- mpic-msgr-receive-mask: Specifies what registers in the containing block
|
||||
are allowed to receive interrupts. The value is a bit mask where a set
|
||||
bit at bit 'n' indicates that message register 'n' can receive interrupts.
|
||||
Note that "bit 'n'" is numbered from LSB for PPC hardware. The type shall
|
||||
be <u32>. If not present, then all of the message registers in the block
|
||||
are available.
|
||||
|
||||
Aliases:
|
||||
|
||||
An alias should be created for every message register block. They are not
|
||||
required, though. However, a particular implementation of this binding
|
||||
may require aliases to be present. Aliases are of the form
|
||||
'mpic-msgr-block<n>', where <n> is an integer specifying the block's number.
|
||||
Numbers shall start at 0.
|
||||
|
||||
Example:
|
||||
|
||||
aliases {
|
||||
mpic-msgr-block0 = &mpic_msgr_block0;
|
||||
mpic-msgr-block1 = &mpic_msgr_block1;
|
||||
};
|
||||
|
||||
mpic_msgr_block0: mpic-msgr-block@41400 {
|
||||
compatible = "fsl,mpic-v3.1-msgr";
|
||||
reg = <0x41400 0x200>;
|
||||
// Message registers 0 and 2 in this block can receive interrupts on
|
||||
// sources 0xb0 and 0xb2, respectively.
|
||||
interrupts = <0xb0 2 0xb2 2>;
|
||||
mpic-msgr-receive-mask = <0x5>;
|
||||
};
|
||||
|
||||
mpic_msgr_block1: mpic-msgr-block@42400 {
|
||||
compatible = "fsl,mpic-v3.1-msgr";
|
||||
reg = <0x42400 0x200>;
|
||||
// Message registers 0 and 2 in this block can receive interrupts on
|
||||
// sources 0xb4 and 0xb6, respectively.
|
||||
interrupts = <0xb4 2 0xb6 2>;
|
||||
mpic-msgr-receive-mask = <0x5>;
|
||||
};
|
@@ -56,7 +56,27 @@ PROPERTIES
|
||||
to the client. The presence of this property also mandates
|
||||
that any initialization related to interrupt sources shall
|
||||
be limited to sources explicitly referenced in the device tree.
|
||||
|
||||
|
||||
- big-endian
|
||||
Usage: optional
|
||||
Value type: <empty>
|
||||
If present the MPIC will be assumed to be big-endian. Some
|
||||
device-trees omit this property on MPIC nodes even when the MPIC is
|
||||
in fact big-endian, so certain boards override this property.
|
||||
|
||||
- single-cpu-affinity
|
||||
Usage: optional
|
||||
Value type: <empty>
|
||||
If present the MPIC will be assumed to only be able to route
|
||||
non-IPI interrupts to a single CPU at a time (EG: Freescale MPIC).
|
||||
|
||||
- last-interrupt-source
|
||||
Usage: optional
|
||||
Value type: <u32>
|
||||
Some MPICs do not correctly report the number of hardware sources
|
||||
in the global feature registers. If specified, this field will
|
||||
override the value read from MPIC_GREG_FEATURE_LAST_SRC.
|
||||
|
||||
INTERRUPT SPECIFIER DEFINITION
|
||||
|
||||
Interrupt specifiers consists of 4 cells encoded as
|
||||
|
@@ -6,8 +6,10 @@ Required properties:
|
||||
etc.) and the second is "fsl,mpic-msi" or "fsl,ipic-msi" depending on
|
||||
the parent type.
|
||||
|
||||
- reg : should contain the address and the length of the shared message
|
||||
interrupt register set.
|
||||
- reg : It may contain one or two regions. The first region should contain
|
||||
the address and the length of the shared message interrupt register set.
|
||||
The second region should contain the address of aliased MSIIR register for
|
||||
platforms that have such an alias.
|
||||
|
||||
- msi-available-ranges: use <start count> style section to define which
|
||||
msi interrupt can be used in the 256 msi interrupts. This property is
|
||||
|
270
Documentation/powerpc/firmware-assisted-dump.txt
Normal file
270
Documentation/powerpc/firmware-assisted-dump.txt
Normal file
@@ -0,0 +1,270 @@
|
||||
|
||||
Firmware-Assisted Dump
|
||||
------------------------
|
||||
July 2011
|
||||
|
||||
The goal of firmware-assisted dump is to enable the dump of
|
||||
a crashed system, and to do so from a fully-reset system, and
|
||||
to minimize the total elapsed time until the system is back
|
||||
in production use.
|
||||
|
||||
- Firmware assisted dump (fadump) infrastructure is intended to replace
|
||||
the existing phyp assisted dump.
|
||||
- Fadump uses the same firmware interfaces and memory reservation model
|
||||
as phyp assisted dump.
|
||||
- Unlike phyp dump, fadump exports the memory dump through /proc/vmcore
|
||||
in the ELF format in the same way as kdump. This helps us reuse the
|
||||
kdump infrastructure for dump capture and filtering.
|
||||
- Unlike phyp dump, userspace tool does not need to refer any sysfs
|
||||
interface while reading /proc/vmcore.
|
||||
- Unlike phyp dump, fadump allows user to release all the memory reserved
|
||||
for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem.
|
||||
- Once enabled through kernel boot parameter, fadump can be
|
||||
started/stopped through /sys/kernel/fadump_registered interface (see
|
||||
sysfs files section below) and can be easily integrated with kdump
|
||||
service start/stop init scripts.
|
||||
|
||||
Comparing with kdump or other strategies, firmware-assisted
|
||||
dump offers several strong, practical advantages:
|
||||
|
||||
-- Unlike kdump, the system has been reset, and loaded
|
||||
with a fresh copy of the kernel. In particular,
|
||||
PCI and I/O devices have been reinitialized and are
|
||||
in a clean, consistent state.
|
||||
-- Once the dump is copied out, the memory that held the dump
|
||||
is immediately available to the running kernel. And therefore,
|
||||
unlike kdump, fadump doesn't need a 2nd reboot to get back
|
||||
the system to the production configuration.
|
||||
|
||||
The above can only be accomplished by coordination with,
|
||||
and assistance from the Power firmware. The procedure is
|
||||
as follows:
|
||||
|
||||
-- The first kernel registers the sections of memory with the
|
||||
Power firmware for dump preservation during OS initialization.
|
||||
These registered sections of memory are reserved by the first
|
||||
kernel during early boot.
|
||||
|
||||
-- When a system crashes, the Power firmware will save
|
||||
the low memory (boot memory of size larger of 5% of system RAM
|
||||
or 256MB) of RAM to the previous registered region. It will
|
||||
also save system registers, and hardware PTE's.
|
||||
|
||||
NOTE: The term 'boot memory' means size of the low memory chunk
|
||||
that is required for a kernel to boot successfully when
|
||||
booted with restricted memory. By default, the boot memory
|
||||
size will be the larger of 5% of system RAM or 256MB.
|
||||
Alternatively, user can also specify boot memory size
|
||||
through boot parameter 'fadump_reserve_mem=' which will
|
||||
override the default calculated size. Use this option
|
||||
if default boot memory size is not sufficient for second
|
||||
kernel to boot successfully.
|
||||
|
||||
-- After the low memory (boot memory) area has been saved, the
|
||||
firmware will reset PCI and other hardware state. It will
|
||||
*not* clear the RAM. It will then launch the bootloader, as
|
||||
normal.
|
||||
|
||||
-- The freshly booted kernel will notice that there is a new
|
||||
node (ibm,dump-kernel) in the device tree, indicating that
|
||||
there is crash data available from a previous boot. During
|
||||
the early boot OS will reserve rest of the memory above
|
||||
boot memory size effectively booting with restricted memory
|
||||
size. This will make sure that the second kernel will not
|
||||
touch any of the dump memory area.
|
||||
|
||||
-- User-space tools will read /proc/vmcore to obtain the contents
|
||||
of memory, which holds the previous crashed kernel dump in ELF
|
||||
format. The userspace tools may copy this info to disk, or
|
||||
network, nas, san, iscsi, etc. as desired.
|
||||
|
||||
-- Once the userspace tool is done saving dump, it will echo
|
||||
'1' to /sys/kernel/fadump_release_mem to release the reserved
|
||||
memory back to general use, except the memory required for
|
||||
next firmware-assisted dump registration.
|
||||
|
||||
e.g.
|
||||
# echo 1 > /sys/kernel/fadump_release_mem
|
||||
|
||||
Please note that the firmware-assisted dump feature
|
||||
is only available on Power6 and above systems with recent
|
||||
firmware versions.
|
||||
|
||||
Implementation details:
|
||||
----------------------
|
||||
|
||||
During boot, a check is made to see if firmware supports
|
||||
this feature on that particular machine. If it does, then
|
||||
we check to see if an active dump is waiting for us. If yes
|
||||
then everything but boot memory size of RAM is reserved during
|
||||
early boot (See Fig. 2). This area is released once we finish
|
||||
collecting the dump from user land scripts (e.g. kdump scripts)
|
||||
that are run. If there is dump data, then the
|
||||
/sys/kernel/fadump_release_mem file is created, and the reserved
|
||||
memory is held.
|
||||
|
||||
If there is no waiting dump data, then only the memory required
|
||||
to hold CPU state, HPTE region, boot memory dump and elfcore
|
||||
header, is reserved at the top of memory (see Fig. 1). This area
|
||||
is *not* released: this region will be kept permanently reserved,
|
||||
so that it can act as a receptacle for a copy of the boot memory
|
||||
content in addition to CPU state and HPTE region, in the case a
|
||||
crash does occur.
|
||||
|
||||
o Memory Reservation during first kernel
|
||||
|
||||
Low memory Top of memory
|
||||
0 boot memory size |
|
||||
| | |<--Reserved dump area -->|
|
||||
V V | Permanent Reservation V
|
||||
+-----------+----------/ /----------+---+----+-----------+----+
|
||||
| | |CPU|HPTE| DUMP |ELF |
|
||||
+-----------+----------/ /----------+---+----+-----------+----+
|
||||
| ^
|
||||
| |
|
||||
\ /
|
||||
-------------------------------------------
|
||||
Boot memory content gets transferred to
|
||||
reserved area by firmware at the time of
|
||||
crash
|
||||
Fig. 1
|
||||
|
||||
o Memory Reservation during second kernel after crash
|
||||
|
||||
Low memory Top of memory
|
||||
0 boot memory size |
|
||||
| |<------------- Reserved dump area ----------- -->|
|
||||
V V V
|
||||
+-----------+----------/ /----------+---+----+-----------+----+
|
||||
| | |CPU|HPTE| DUMP |ELF |
|
||||
+-----------+----------/ /----------+---+----+-----------+----+
|
||||
| |
|
||||
V V
|
||||
Used by second /proc/vmcore
|
||||
kernel to boot
|
||||
Fig. 2
|
||||
|
||||
Currently the dump will be copied from /proc/vmcore to a
|
||||
a new file upon user intervention. The dump data available through
|
||||
/proc/vmcore will be in ELF format. Hence the existing kdump
|
||||
infrastructure (kdump scripts) to save the dump works fine with
|
||||
minor modifications.
|
||||
|
||||
The tools to examine the dump will be same as the ones
|
||||
used for kdump.
|
||||
|
||||
How to enable firmware-assisted dump (fadump):
|
||||
-------------------------------------
|
||||
|
||||
1. Set config option CONFIG_FA_DUMP=y and build kernel.
|
||||
2. Boot into linux kernel with 'fadump=on' kernel cmdline option.
|
||||
3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline
|
||||
to specify size of the memory to reserve for boot memory dump
|
||||
preservation.
|
||||
|
||||
NOTE: If firmware-assisted dump fails to reserve memory then it will
|
||||
fallback to existing kdump mechanism if 'crashkernel=' option
|
||||
is set at kernel cmdline.
|
||||
|
||||
Sysfs/debugfs files:
|
||||
------------
|
||||
|
||||
Firmware-assisted dump feature uses sysfs file system to hold
|
||||
the control files and debugfs file to display memory reserved region.
|
||||
|
||||
Here is the list of files under kernel sysfs:
|
||||
|
||||
/sys/kernel/fadump_enabled
|
||||
|
||||
This is used to display the fadump status.
|
||||
0 = fadump is disabled
|
||||
1 = fadump is enabled
|
||||
|
||||
This interface can be used by kdump init scripts to identify if
|
||||
fadump is enabled in the kernel and act accordingly.
|
||||
|
||||
/sys/kernel/fadump_registered
|
||||
|
||||
This is used to display the fadump registration status as well
|
||||
as to control (start/stop) the fadump registration.
|
||||
0 = fadump is not registered.
|
||||
1 = fadump is registered and ready to handle system crash.
|
||||
|
||||
To register fadump echo 1 > /sys/kernel/fadump_registered and
|
||||
echo 0 > /sys/kernel/fadump_registered for un-register and stop the
|
||||
fadump. Once the fadump is un-registered, the system crash will not
|
||||
be handled and vmcore will not be captured. This interface can be
|
||||
easily integrated with kdump service start/stop.
|
||||
|
||||
/sys/kernel/fadump_release_mem
|
||||
|
||||
This file is available only when fadump is active during
|
||||
second kernel. This is used to release the reserved memory
|
||||
region that are held for saving crash dump. To release the
|
||||
reserved memory echo 1 to it:
|
||||
|
||||
echo 1 > /sys/kernel/fadump_release_mem
|
||||
|
||||
After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region
|
||||
file will change to reflect the new memory reservations.
|
||||
|
||||
The existing userspace tools (kdump infrastructure) can be easily
|
||||
enhanced to use this interface to release the memory reserved for
|
||||
dump and continue without 2nd reboot.
|
||||
|
||||
Here is the list of files under powerpc debugfs:
|
||||
(Assuming debugfs is mounted on /sys/kernel/debug directory.)
|
||||
|
||||
/sys/kernel/debug/powerpc/fadump_region
|
||||
|
||||
This file shows the reserved memory regions if fadump is
|
||||
enabled otherwise this file is empty. The output format
|
||||
is:
|
||||
<region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size>
|
||||
|
||||
e.g.
|
||||
Contents when fadump is registered during first kernel
|
||||
|
||||
# cat /sys/kernel/debug/powerpc/fadump_region
|
||||
CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0
|
||||
HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0
|
||||
DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0
|
||||
|
||||
Contents when fadump is active during second kernel
|
||||
|
||||
# cat /sys/kernel/debug/powerpc/fadump_region
|
||||
CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020
|
||||
HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000
|
||||
DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000
|
||||
: [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000
|
||||
|
||||
NOTE: Please refer to Documentation/filesystems/debugfs.txt on
|
||||
how to mount the debugfs filesystem.
|
||||
|
||||
|
||||
TODO:
|
||||
-----
|
||||
o Need to come up with the better approach to find out more
|
||||
accurate boot memory size that is required for a kernel to
|
||||
boot successfully when booted with restricted memory.
|
||||
o The fadump implementation introduces a fadump crash info structure
|
||||
in the scratch area before the ELF core header. The idea of introducing
|
||||
this structure is to pass some important crash info data to the second
|
||||
kernel which will help second kernel to populate ELF core header with
|
||||
correct data before it gets exported through /proc/vmcore. The current
|
||||
design implementation does not address a possibility of introducing
|
||||
additional fields (in future) to this structure without affecting
|
||||
compatibility. Need to come up with the better approach to address this.
|
||||
The possible approaches are:
|
||||
1. Introduce version field for version tracking, bump up the version
|
||||
whenever a new field is added to the structure in future. The version
|
||||
field can be used to find out what fields are valid for the current
|
||||
version of the structure.
|
||||
2. Reserve the area of predefined size (say PAGE_SIZE) for this
|
||||
structure and have unused area as reserved (initialized to zero)
|
||||
for future field additions.
|
||||
The advantage of approach 1 over 2 is we don't need to reserve extra space.
|
||||
---
|
||||
Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
|
||||
This document is based on the original documentation written for phyp
|
||||
assisted dump by Linas Vepstas and Manish Ahuja.
|
@@ -2,7 +2,7 @@ Linux 2.6.x on MPC52xx family
|
||||
-----------------------------
|
||||
|
||||
For the latest info, go to http://www.246tNt.com/mpc52xx/
|
||||
|
||||
|
||||
To compile/use :
|
||||
|
||||
- U-Boot:
|
||||
@@ -10,23 +10,23 @@ To compile/use :
|
||||
if you wish to ).
|
||||
# make lite5200_defconfig
|
||||
# make uImage
|
||||
|
||||
|
||||
then, on U-boot:
|
||||
=> tftpboot 200000 uImage
|
||||
=> tftpboot 400000 pRamdisk
|
||||
=> bootm 200000 400000
|
||||
|
||||
|
||||
- DBug:
|
||||
# <edit Makefile to set ARCH=ppc & CROSS_COMPILE=... ( also EXTRAVERSION
|
||||
if you wish to ).
|
||||
# make lite5200_defconfig
|
||||
# cp your_initrd.gz arch/ppc/boot/images/ramdisk.image.gz
|
||||
# make zImage.initrd
|
||||
# make
|
||||
# make zImage.initrd
|
||||
# make
|
||||
|
||||
then in DBug:
|
||||
DBug> dn -i zImage.initrd.lite5200
|
||||
|
||||
|
||||
|
||||
Some remarks :
|
||||
- The port is named mpc52xxx, and config options are PPC_MPC52xx. The MGT5100
|
||||
|
@@ -1,127 +0,0 @@
|
||||
|
||||
Hypervisor-Assisted Dump
|
||||
------------------------
|
||||
November 2007
|
||||
|
||||
The goal of hypervisor-assisted dump is to enable the dump of
|
||||
a crashed system, and to do so from a fully-reset system, and
|
||||
to minimize the total elapsed time until the system is back
|
||||
in production use.
|
||||
|
||||
As compared to kdump or other strategies, hypervisor-assisted
|
||||
dump offers several strong, practical advantages:
|
||||
|
||||
-- Unlike kdump, the system has been reset, and loaded
|
||||
with a fresh copy of the kernel. In particular,
|
||||
PCI and I/O devices have been reinitialized and are
|
||||
in a clean, consistent state.
|
||||
-- As the dump is performed, the dumped memory becomes
|
||||
immediately available to the system for normal use.
|
||||
-- After the dump is completed, no further reboots are
|
||||
required; the system will be fully usable, and running
|
||||
in its normal, production mode on its normal kernel.
|
||||
|
||||
The above can only be accomplished by coordination with,
|
||||
and assistance from the hypervisor. The procedure is
|
||||
as follows:
|
||||
|
||||
-- When a system crashes, the hypervisor will save
|
||||
the low 256MB of RAM to a previously registered
|
||||
save region. It will also save system state, system
|
||||
registers, and hardware PTE's.
|
||||
|
||||
-- After the low 256MB area has been saved, the
|
||||
hypervisor will reset PCI and other hardware state.
|
||||
It will *not* clear RAM. It will then launch the
|
||||
bootloader, as normal.
|
||||
|
||||
-- The freshly booted kernel will notice that there
|
||||
is a new node (ibm,dump-kernel) in the device tree,
|
||||
indicating that there is crash data available from
|
||||
a previous boot. It will boot into only 256MB of RAM,
|
||||
reserving the rest of system memory.
|
||||
|
||||
-- Userspace tools will parse /sys/kernel/release_region
|
||||
and read /proc/vmcore to obtain the contents of memory,
|
||||
which holds the previous crashed kernel. The userspace
|
||||
tools may copy this info to disk, or network, nas, san,
|
||||
iscsi, etc. as desired.
|
||||
|
||||
For Example: the values in /sys/kernel/release-region
|
||||
would look something like this (address-range pairs).
|
||||
CPU:0x177fee000-0x10000: HPTE:0x177ffe020-0x1000: /
|
||||
DUMP:0x177fff020-0x10000000, 0x10000000-0x16F1D370A
|
||||
|
||||
-- As the userspace tools complete saving a portion of
|
||||
dump, they echo an offset and size to
|
||||
/sys/kernel/release_region to release the reserved
|
||||
memory back to general use.
|
||||
|
||||
An example of this is:
|
||||
"echo 0x40000000 0x10000000 > /sys/kernel/release_region"
|
||||
which will release 256MB at the 1GB boundary.
|
||||
|
||||
Please note that the hypervisor-assisted dump feature
|
||||
is only available on Power6-based systems with recent
|
||||
firmware versions.
|
||||
|
||||
Implementation details:
|
||||
----------------------
|
||||
|
||||
During boot, a check is made to see if firmware supports
|
||||
this feature on this particular machine. If it does, then
|
||||
we check to see if a active dump is waiting for us. If yes
|
||||
then everything but 256 MB of RAM is reserved during early
|
||||
boot. This area is released once we collect a dump from user
|
||||
land scripts that are run. If there is dump data, then
|
||||
the /sys/kernel/release_region file is created, and
|
||||
the reserved memory is held.
|
||||
|
||||
If there is no waiting dump data, then only the highest
|
||||
256MB of the ram is reserved as a scratch area. This area
|
||||
is *not* released: this region will be kept permanently
|
||||
reserved, so that it can act as a receptacle for a copy
|
||||
of the low 256MB in the case a crash does occur. See,
|
||||
however, "open issues" below, as to whether
|
||||
such a reserved region is really needed.
|
||||
|
||||
Currently the dump will be copied from /proc/vmcore to a
|
||||
a new file upon user intervention. The starting address
|
||||
to be read and the range for each data point in provided
|
||||
in /sys/kernel/release_region.
|
||||
|
||||
The tools to examine the dump will be same as the ones
|
||||
used for kdump.
|
||||
|
||||
General notes:
|
||||
--------------
|
||||
Security: please note that there are potential security issues
|
||||
with any sort of dump mechanism. In particular, plaintext
|
||||
(unencrypted) data, and possibly passwords, may be present in
|
||||
the dump data. Userspace tools must take adequate precautions to
|
||||
preserve security.
|
||||
|
||||
Open issues/ToDo:
|
||||
------------
|
||||
o The various code paths that tell the hypervisor that a crash
|
||||
occurred, vs. it simply being a normal reboot, should be
|
||||
reviewed, and possibly clarified/fixed.
|
||||
|
||||
o Instead of using /sys/kernel, should there be a /sys/dump
|
||||
instead? There is a dump_subsys being created by the s390 code,
|
||||
perhaps the pseries code should use a similar layout as well.
|
||||
|
||||
o Is reserving a 256MB region really required? The goal of
|
||||
reserving a 256MB scratch area is to make sure that no
|
||||
important crash data is clobbered when the hypervisor
|
||||
save low mem to the scratch area. But, if one could assure
|
||||
that nothing important is located in some 256MB area, then
|
||||
it would not need to be reserved. Something that can be
|
||||
improved in subsequent versions.
|
||||
|
||||
o Still working the kdump team to integrate this with kdump,
|
||||
some work remains but this would not affect the current
|
||||
patches.
|
||||
|
||||
o Still need to write a shell script, to copy the dump away.
|
||||
Currently I am parsing it manually.
|
Reference in New Issue
Block a user