123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184 |
- .. hwpoison:
- ========
- hwpoison
- ========
- What is hwpoison?
- =================
- Upcoming Intel CPUs have support for recovering from some memory errors
- (``MCA recovery``). This requires the OS to declare a page "poisoned",
- kill the processes associated with it and avoid using it in the future.
- This patchkit implements the necessary infrastructure in the VM.
- To quote the overview comment::
- High level machine check handler. Handles pages reported by the
- hardware as being corrupted usually due to a 2bit ECC memory or cache
- failure.
- This focusses on pages detected as corrupted in the background.
- When the current CPU tries to consume corruption the currently
- running process can just be killed directly instead. This implies
- that if the error cannot be handled for some reason it's safe to
- just ignore it because no corruption has been consumed yet. Instead
- when that happens another machine check will happen.
- Handles page cache pages in various states. The tricky part
- here is that we can access any page asynchronous to other VM
- users, because memory failures could happen anytime and anywhere,
- possibly violating some of their assumptions. This is why this code
- has to be extremely careful. Generally it tries to use normal locking
- rules, as in get the standard locks, even if that means the
- error handling takes potentially a long time.
- Some of the operations here are somewhat inefficient and have non
- linear algorithmic complexity, because the data structures have not
- been optimized for this case. This is in particular the case
- for the mapping from a vma to a process. Since this case is expected
- to be rare we hope we can get away with this.
- The code consists of a the high level handler in mm/memory-failure.c,
- a new page poison bit and various checks in the VM to handle poisoned
- pages.
- The main target right now is KVM guests, but it works for all kinds
- of applications. KVM support requires a recent qemu-kvm release.
- For the KVM use there was need for a new signal type so that
- KVM can inject the machine check into the guest with the proper
- address. This in theory allows other applications to handle
- memory failures too. The expection is that near all applications
- won't do that, but some very specialized ones might.
- Failure recovery modes
- ======================
- There are two (actually three) modes memory failure recovery can be in:
- vm.memory_failure_recovery sysctl set to zero:
- All memory failures cause a panic. Do not attempt recovery.
- early kill
- (can be controlled globally and per process)
- Send SIGBUS to the application as soon as the error is detected
- This allows applications who can process memory errors in a gentle
- way (e.g. drop affected object)
- This is the mode used by KVM qemu.
- late kill
- Send SIGBUS when the application runs into the corrupted page.
- This is best for memory error unaware applications and default
- Note some pages are always handled as late kill.
- User control
- ============
- vm.memory_failure_recovery
- See sysctl.txt
- vm.memory_failure_early_kill
- Enable early kill mode globally
- PR_MCE_KILL
- Set early/late kill mode/revert to system default
- arg1: PR_MCE_KILL_CLEAR:
- Revert to system default
- arg1: PR_MCE_KILL_SET:
- arg2 defines thread specific mode
- PR_MCE_KILL_EARLY:
- Early kill
- PR_MCE_KILL_LATE:
- Late kill
- PR_MCE_KILL_DEFAULT
- Use system global default
- Note that if you want to have a dedicated thread which handles
- the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should
- call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise,
- the SIGBUS is sent to the main thread.
- PR_MCE_KILL_GET
- return current mode
- Testing
- =======
- * madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the
- process for testing
- * hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/``
- corrupt-pfn
- Inject hwpoison fault at PFN echoed into this file. This does
- some early filtering to avoid corrupted unintended pages in test suites.
- unpoison-pfn
- Software-unpoison page at PFN echoed into this file. This way
- a page can be reused again. This only works for Linux
- injected failures, not for real memory failures. Once any hardware
- memory failure happens, this feature is disabled.
- Note these injection interfaces are not stable and might change between
- kernel versions
- corrupt-filter-dev-major, corrupt-filter-dev-minor
- Only handle memory failures to pages associated with the file
- system defined by block device major/minor. -1U is the
- wildcard value. This should be only used for testing with
- artificial injection.
- corrupt-filter-memcg
- Limit injection to pages owned by memgroup. Specified by inode
- number of the memcg.
- Example::
- mkdir /sys/fs/cgroup/mem/hwpoison
- usemem -m 100 -s 1000 &
- echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
- memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
- echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
- page-types -p `pidof init` --hwpoison # shall do nothing
- page-types -p `pidof usemem` --hwpoison # poison its pages
- corrupt-filter-flags-mask, corrupt-filter-flags-value
- When specified, only poison pages if ((page_flags & mask) ==
- value). This allows stress testing of many kinds of
- pages. The page_flags are the same as in /proc/kpageflags. The
- flag bits are defined in include/linux/kernel-page-flags.h and
- documented in Documentation/admin-guide/mm/pagemap.rst
- * Architecture specific MCE injector
- x86 has mce-inject, mce-test
- Some portable hwpoison test programs in mce-test, see below.
- References
- ==========
- http://halobates.de/mce-lc09-2.pdf
- Overview presentation from LinuxCon 09
- git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
- Test suite (hwpoison specific portable tests in tsrc)
- git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
- x86 specific injector
- Limitations
- ===========
- - Not all page types are supported and never will. Most kernel internal
- objects cannot be recovered, only LRU pages for now.
- ---
- Andi Kleen, Oct 2009
|