The Corrected Machine Check structure (CMC) in HEST has a flag which can be
set by the firmware to indicate to the OS that it prefers to process the
corrected error events first. In this scenario, the OS is expected to not
monitor for corrected errors (through CMCI/polling). Instead, the firmware
notifies the OS on corrected error events through GHES.
Linux already has support for GHES. This patch adds support for parsing CMC
structure and to disable CMCI/polling if the firmware first flag is set.
Further, the list of machine check bank structures at the end of CMC is used
to determine which MCA banks function in FF mode, so that we continue to
monitor error events on the other banks.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Pull timer core updates from Thomas Gleixner:
"The timer changes contain:
- posix timer code consolidation and fixes for odd corner cases
- sched_clock implementation moved from ARM to core code to avoid
duplication by other architectures
- alarm timer updates
- clocksource and clockevents unregistration facilities
- clocksource/events support for new hardware
- precise nanoseconds RTC readout (Xen feature)
- generic support for Xen suspend/resume oddities
- the usual lot of fixes and cleanups all over the place
The parts which touch other areas (ARM/XEN) have been coordinated with
the relevant maintainers. Though this results in an handful of
trivial to solve merge conflicts, which we preferred over nasty cross
tree merge dependencies.
The patches which have been committed in the last few days are bug
fixes plus the posix timer lot. The latter was in akpms queue and
next for quite some time; they just got forgotten and Frederic
collected them last minute."
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits)
hrtimer: Remove unused variable
hrtimers: Move SMP function call to thread context
clocksource: Reselect clocksource when watchdog validated high-res capability
posix-cpu-timers: don't account cpu timer after stopped thread runtime accounting
posix_timers: fix racy timer delta caching on task exit
posix-timers: correctly get dying task time sample in posix_cpu_timer_schedule()
selftests: add basic posix timers selftests
posix_cpu_timers: consolidate expired timers check
posix_cpu_timers: consolidate timer list cleanups
posix_cpu_timer: consolidate expiry time type
tick: Sanitize broadcast control logic
tick: Prevent uncontrolled switch to oneshot mode
tick: Make oneshot broadcast robust vs. CPU offlining
x86: xen: Sync the CMOS RTC as well as the Xen wallclock
x86: xen: Sync the wallclock when the system time is set
timekeeping: Indicate that clock was set in the pvclock gtod notifier
timekeeping: Pass flags instead of multiple bools to timekeeping_update()
xen: Remove clock_was_set() call in the resume path
hrtimers: Support resuming with two or more CPUs online (but stopped)
timer: Fix jiffies wrap behavior of round_jiffies_common()
...
Pull irqdomain refactoring from Grant Likely:
"This is the long awaited simplification of irqdomain. It gets rid of
the different types of irq domains and instead both linear and tree
mappings can be supported in a single domain. Doing this removes a
lot of special case code and makes irq domains simpler to understand
overall"
* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux:
irq: fix checkpatch error
irqdomain: Include hwirq number in /proc/interrupts
irqdomain: make irq_linear_revmap() a fast path again
irqdomain: remove irq_domain_generate_simple()
irqdomain: Refactor irq_domain_associate_many()
irqdomain: Beef up debugfs output
irqdomain: Clean up aftermath of irq_domain refactoring
irqdomain: Eliminate revmap type
irqdomain: merge linear and tree reverse mappings.
irqdomain: Add a name field
irqdomain: Replace LEGACY mapping with LINEAR
irqdomain: Relax failure path on setting up mappings
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/core
Frederic sayed: "Most of these patches have been hanging around for
several month now, in -mmotm for a significant chunk. They already
missed a few releases."
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull trivial tree updates from Jiri Kosina:
"The usual stuff from trivial tree"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
treewide: relase -> release
Documentation/cgroups/memory.txt: fix stat file documentation
sysctl/net.txt: delete reference to obsolete 2.4.x kernel
spinlock_api_smp.h: fix preprocessor comments
treewide: Fix typo in printk
doc: device tree: clarify stuff in usage-model.txt.
open firmware: "/aliasas" -> "/aliases"
md: bcache: Fixed a typo with the word 'arithmetic'
irq/generic-chip: fix a few kernel-doc entries
frv: Convert use of typedef ctl_table to struct ctl_table
sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
doc: clk: Fix incorrect wording
Documentation/arm/IXP4xx fix a typo
Documentation/networking/ieee802154 fix a typo
Documentation/DocBook/media/v4l fix a typo
Documentation/video4linux/si476x.txt fix a typo
Documentation/virtual/kvm/api.txt fix a typo
Documentation/early-userspace/README fix a typo
Documentation/video4linux/soc-camera.txt fix a typo
lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
...
Merge first patch-bomb from Andrew Morton:
- various misc bits
- I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been
distracted. There has been quite a bit of activity.
- About half the MM queue
- Some backlight bits
- Various lib/ updates
- checkpatch updates
- zillions more little rtc patches
- ptrace
- signals
- exec
- procfs
- rapidio
- nbd
- aoe
- pps
- memstick
- tools/testing/selftests updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (445 commits)
tools/testing/selftests: don't assume the x bit is set on scripts
selftests: add .gitignore for kcmp
selftests: fix clean target in kcmp Makefile
selftests: add .gitignore for vm
selftests: add hugetlbfstest
self-test: fix make clean
selftests: exit 1 on failure
kernel/resource.c: remove the unneeded assignment in function __find_resource
aio: fix wrong comment in aio_complete()
drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode
drivers/memstick/host/r592.c: convert to module_pci_driver
drivers/memstick/host/jmb38x_ms: convert to module_pci_driver
pps-gpio: add device-tree binding and support
drivers/pps/clients/pps-gpio.c: convert to module_platform_driver
drivers/pps/clients/pps-gpio.c: convert to devm_* helpers
drivers/parport/share.c: use kzalloc
Documentation/accounting/getdelays.c: avoid strncpy in accounting tool
aoe: update internal version number to v83
aoe: update copyright date
aoe: perform I/O completions in parallel
...
Pull power management and ACPI updates from Rafael Wysocki:
"This time the total number of ACPI commits is slightly greater than
the number of cpufreq commits, but Viresh Kumar (who works on cpufreq)
remains the most active patch submitter.
To me, the most significant change is the addition of offline/online
device operations to the driver core (with the Greg's blessing) and
the related modifications of the ACPI core hotplug code. Next are the
freezer updates from Colin Cross that should make the freezing of
tasks a bit less heavy weight.
We also have a couple of regression fixes, a number of fixes for
issues that have not been identified as regressions, two new drivers
and a bunch of cleanups all over.
Highlights:
- Hotplug changes to support graceful hot-removal failures.
It sometimes is necessary to fail device hot-removal operations
gracefully if they cannot be carried out completely. For example,
if memory from a memory module being hot-removed has been allocated
for the kernel's own use and cannot be moved elsewhere, it's
desirable to fail the hot-removal operation in a graceful way
rather than to crash the kernel, but currenty a success or a kernel
crash are the only possible outcomes of an attempted memory
hot-removal. Needless to say, that is not a very attractive
alternative and it had to be addressed.
However, in order to make it work for memory, I first had to make
it work for CPUs and for this purpose I needed to modify the ACPI
processor driver. It's been split into two parts, a resident one
handling the low-level initialization/cleanup and a modular one
playing the actual driver's role (but it binds to the CPU system
device objects rather than to the ACPI device objects representing
processors). That's been sort of like a live brain surgery on a
patient who's riding a bike.
So this is a little scary, but since we found and fixed a couple of
regressions it caused to happen during the early linux-next testing
(a month ago), nobody has complained.
As a bonus we remove some duplicated ACPI hotplug code, because the
ACPI-based CPU hotplug is now going to use the common ACPI hotplug
code.
- Lighter weight freezing of tasks.
These changes from Colin Cross and Mandeep Singh Baines are
targeted at making the freezing of tasks a bit less heavy weight
operation. They reduce the number of tasks woken up every time
during the freezing, by using the observation that the freezer
simply doesn't need to wake up some of them and wait for them all
to call refrigerator(). The time needed for the freezer to decide
to report a failure is reduced too.
Also reintroduced is the check causing a lockdep warining to
trigger when try_to_freeze() is called with locks held (which is
generally unsafe and shouldn't happen).
- cpufreq updates
First off, a commit from Srivatsa S Bhat fixes a resume regression
introduced during the 3.10 cycle causing some cpufreq sysfs
attributes to return wrong values to user space after resume. The
fix is kind of fresh, but also it's pretty obvious once Srivatsa
has identified the root cause.
Second, we have a new freqdomain_cpus sysfs attribute for the
acpi-cpufreq driver to provide information previously available via
related_cpus. From Lan Tianyu.
Finally, we fix a number of issues, mostly related to the
CPUFREQ_POSTCHANGE notifier and cpufreq Kconfig options and clean
up some code. The majority of changes from Viresh Kumar with bits
from Jacob Shin, Heiko Stübner, Xiaoguang Chen, Ezequiel Garcia,
Arnd Bergmann, and Tang Yuantian.
- ACPICA update
A usual bunch of updates from the ACPICA upstream.
During the 3.4 cycle we introduced support for ACPI 5 extended
sleep registers, but they are only supposed to be used if the
HW-reduced mode bit is set in the FADT flags and the code attempted
to use them without checking that bit. That caused suspend/resume
regressions to happen on some systems. Fix from Lv Zheng causes
those registers to be used only if the HW-reduced mode bit is set.
Apart from this some other ACPICA bugs are fixed and code cleanups
are made by Bob Moore, Tomasz Nowicki, Lv Zheng, Chao Guan, and
Zhang Rui.
- cpuidle updates
New driver for Xilinx Zynq processors is added by Michal Simek.
Multidriver support simplification, addition of some missing
kerneldoc comments and Kconfig-related fixes come from Daniel
Lezcano.
- ACPI power management updates
Changes to make suspend/resume work correctly in Xen guests from
Konrad Rzeszutek Wilk, sparse warning fix from Fengguang Wu and
cleanups and fixes of the ACPI device power state selection
routine.
- ACPI documentation updates
Some previously missing pieces of ACPI documentation are added by
Lv Zheng and Aaron Lu (hopefully, that will help people to
uderstand how the ACPI subsystem works) and one outdated doc is
updated by Hanjun Guo.
- Assorted ACPI updates
We finally nailed down the IA-64 issue that was the reason for
reverting commit 9f29ab11dd ("ACPI / scan: do not match drivers
against objects having scan handlers"), so we can fix it and move
the ACPI scan handler check added to the ACPI video driver back to
the core.
A mechanism for adding CMOS RTC address space handlers is
introduced by Lan Tianyu to allow some EC-related breakage to be
fixed on some systems.
A spec-compliant implementation of acpi_os_get_timer() is added by
Mika Westerberg.
The evaluation of _STA is added to do_acpi_find_child() to avoid
situations in which a pointer to a disabled device object is
returned instead of an enabled one with the same _ADR value. From
Jeff Wu.
Intel BayTrail PCH (Platform Controller Hub) support is added to
the ACPI driver for Intel Low-Power Subsystems (LPSS) and that
driver is modified to work around a couple of known BIOS issues.
Changes from Mika Westerberg and Heikki Krogerus.
The EC driver is fixed by Vasiliy Kulikov to use get_user() and
put_user() instead of dereferencing user space pointers blindly.
Code cleanups are made by Bjorn Helgaas, Nicholas Mazzuca and Toshi
Kani.
- Assorted power management updates
The "runtime idle" helper routine is changed to take the return
values of the callbacks executed by it into account and to call
rpm_suspend() if they return 0, which allows us to reduce the
overall code bloat a bit (by dropping some code that's not
necessary any more after that modification).
The runtime PM documentation is updated by Alan Stern (to reflect
the "runtime idle" behavior change).
New trace points for PM QoS are added by Sahara
(<keun-o.park@windriver.com>).
PM QoS documentation is updated by Lan Tianyu.
Code cleanups are made and minor issues are addressed by Bernie
Thompson, Bjorn Helgaas, Julius Werner, and Shuah Khan.
- devfreq updates
New driver for the Exynos5-bus device from Abhilash Kesavan.
Minor cleanups, fixes and MAINTAINERS update from MyungJoo Ham,
Abhilash Kesavan, Paul Bolle, Rajagopal Venkat, and Wei Yongjun.
- OMAP power management updates
Adaptive Voltage Scaling (AVS) SmartReflex voltage control driver
updates from Andrii Tseglytskyi and Nishanth Menon."
* tag 'pm+acpi-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (162 commits)
cpufreq: Fix cpufreq regression after suspend/resume
ACPI / PM: Fix possible NULL pointer deref in acpi_pm_device_sleep_state()
PM / Sleep: Warn about system time after resume with pm_trace
cpufreq: don't leave stale policy pointer in cdbs->cur_policy
acpi-cpufreq: Add new sysfs attribute freqdomain_cpus
cpufreq: make sure frequency transitions are serialized
ACPI: implement acpi_os_get_timer() according the spec
ACPI / EC: Add HP Folio 13 to ec_dmi_table in order to skip DSDT scan
ACPI: Add CMOS RTC Operation Region handler support
ACPI / processor: Drop unused variable from processor_perflib.c
cpufreq: tegra: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: s3c64xx: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: omap: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: imx6q: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: exynos: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: dbx500: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: davinci: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: arm-big-little: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: powernow-k8: call CPUFREQ_POSTCHANGE notfier in error cases
cpufreq: pcc: call CPUFREQ_POSTCHANGE notfier in error cases
...
Pull thermal power-limit update from Tony Luck:
"Thermal limit warnings are too scary and cause unnecessary concern"
* tag 'please-pull-mce-therm' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
x86 thermal: Disable power limit notification interrupt by default
x86 thermal: Delete power-limit-notification console messages
Pull x86 UV update from Ingo Molnar:
"There's a single commit in this tree, which adds support for a new SGI
UV GRU (Global Reference Unit - fast NUMA messaging ASIC) hardware
feature to scale up and beyond: an optional distributed mode that will
allow per-node address mapping of local GRU space, as opposed to
mapping all GRU hardware to the same contiguous high space"
* 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/UV: Add GRU distributed mode mappings
Pull x86 tracing updates from Ingo Molnar:
"This tree adds IRQ vector tracepoints that are named after the handler
and which output the vector #, based on a zero-overhead approach that
relies on changing the IDT entries, by Seiji Aguchi.
The new tracepoints look like this:
# perf list | grep -i irq_vector
irq_vectors:local_timer_entry [Tracepoint event]
irq_vectors:local_timer_exit [Tracepoint event]
irq_vectors:reschedule_entry [Tracepoint event]
irq_vectors:reschedule_exit [Tracepoint event]
irq_vectors:spurious_apic_entry [Tracepoint event]
irq_vectors:spurious_apic_exit [Tracepoint event]
irq_vectors:error_apic_entry [Tracepoint event]
irq_vectors:error_apic_exit [Tracepoint event]
[...]"
* 'x86-tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tracing: Add config option checking to the definitions of mce handlers
trace,x86: Do not call local_irq_save() in load_current_idt()
trace,x86: Move creation of irq tracepoints from apic.c to irq.c
x86, trace: Add irq vector tracepoints
x86: Rename variables for debugging
x86, trace: Introduce entering/exiting_irq()
tracing: Add DEFINE_EVENT_FN() macro
Pull x86 RAS update from Ingo Molnar:
"The changes in this tree are:
- ACPI APEI (ACPI Platform Error Interface) improvements, by Chen
Gong
- misc MCE fixes/cleanups"
* 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Update MCE severity condition check
mce: acpi/apei: Add comments to clarify usage of the various bitfields in the MCA subsystem
ACPI/APEI: Update einj documentation for param1/param2
ACPI/APEI: Add parameter check before error injection
ACPI, APEI, EINJ: Fix error return code in einj_init()
x86, mce: Fix "braodcast" typo
Pull x86 mm changes from Ingo Molnar:
"Misc improvements:
- Fix /proc/mtrr reporting
- Fix ioremap printout
- Remove the unused pvclock fixmap entry on 32-bit
- misc cleanups"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioremap: Correct function name output
x86: Fix /proc/mtrr with base/size more than 44bits
ix86: Don't waste fixmap entries
x86/mm: Drop unneeded include <asm/*pgtable, page*_types.h>
x86_64: Correct phys_addr in cleanup_highmap comment
Pull x86 microcode loading update from Ingo Molnar:
"Two main changes that improve microcode loading on AMD CPUs:
- Add support for all-in-one binary microcode files that concatenate
the microcode images of multiple processor families, by Jacob Shin
- Add early microcode loading (embedded in the initrd) support, also
by Jacob Shin"
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, microcode, amd: Another early loading fixup
x86, microcode, amd: Allow multiple families' bin files appended together
x86, microcode, amd: Make find_ucode_in_initrd() __init
x86, microcode, amd: Fix warnings and errors on with CONFIG_MICROCODE=m
x86, microcode, amd: Early microcode patch loading support for AMD
x86, microcode, amd: Refactor functions to prepare for early loading
x86, microcode: Vendor abstract out save_microcode_in_initrd()
x86, microcode, intel: Correct typo in printk
Pull x86 FPU changes from Ingo Molnar:
"There are two bigger changes in this tree:
- Add an [early-use-]safe static_cpu_has() variant and other
robustness improvements, including the new X86_DEBUG_STATIC_CPU_HAS
configurable debugging facility, motivated by recent obscure FPU
code bugs, by Borislav Petkov
- Reimplement FPU detection code in C and drop the old asm code, by
Peter Anvin."
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, fpu: Use static_cpu_has_safe before alternatives
x86: Add a static_cpu_has_safe variant
x86: Sanity-check static_cpu_has usage
x86, cpu: Add a synthetic, always true, cpu feature
x86: Get rid of ->hard_math and all the FPU asm fu
Pull x86 debug update from Ingo Molnar:
"Misc debuggability improvements:
- Optimize the x86 CPU register printout a bit
- Expose the tboot TXT log via debugfs
- Small do_debug() cleanup"
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tboot: Provide debugfs interfaces to access TXT log
x86: Remove weird PTR_ERR() in do_debug
x86/debug: Only print out DR registers if they are not power-on defaults
Pull x86 cpu updates from Ingo Molnar:
"Two changes:
- Extend 32-bit double fault debugging aid to 64-bit
- Fix a build warning"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel/cacheinfo: Shut up last long-standing warning
x86: Extend #DF debugging aid to 64-bit
Pull x86 cleanups from Ingo Molnar:
"Misc x86 cleanups"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, reloc: Use xorl instead of xorq in relocate_kernel_64.S
x86, cleanups: Remove extra tab in __flush_tlb_one()
x86/mce: Remove check for CONFIG_X86_MCE_P4THERMAL
Pull asm/x86 changes from Ingo Molnar:
"Misc changes, with a bigger processor-flags cleanup/reorganization by
Peter Anvin"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, asm, cleanup: Replace open-coded control register values with symbolic
x86, processor-flags: Fix the datatypes and add bit number defines
x86: Rename X86_CR4_RDWRGSFS to X86_CR4_FSGSBASE
x86, flags: Rename X86_EFLAGS_BIT1 to X86_EFLAGS_FIXED
linux/const.h: Add _BITUL() and _BITULL()
x86/vdso: Convert use of typedef ctl_table to struct ctl_table
x86: __force_order doesn't need to be an actual variable
Pull perf updates from Ingo Molnar:
"Kernel improvements:
- watchdog driver improvements by Li Zefan
- Power7 CPI stack events related improvements by Sukadev Bhattiprolu
- event multiplexing via hrtimers and other improvements by Stephane
Eranian
- kernel stack use optimization by Andrew Hunter
- AMD IOMMU uncore PMU support by Suravee Suthikulpanit
- NMI handling rate-limits by Dave Hansen
- various hw_breakpoint fixes by Oleg Nesterov
- hw_breakpoint overflow period sampling and related signal handling
fixes by Jiri Olsa
- Intel Haswell PMU support by Andi Kleen
Tooling improvements:
- Reset SIGTERM handler in workload child process, fix from David
Ahern.
- Makefile reorganization, prep work for Kconfig patches, from Jiri
Olsa.
- Add automated make test suite, from Jiri Olsa.
- Add --percent-limit option to 'top' and 'report', from Namhyung
Kim.
- Sorting improvements, from Namhyung Kim.
- Expand definition of sysfs format attribute, from Michael Ellerman.
Tooling fixes:
- 'perf tests' fixes from Jiri Olsa.
- Make Power7 CPI stack events available in sysfs, from Sukadev
Bhattiprolu.
- Handle death by SIGTERM in 'perf record', fix from David Ahern.
- Fix printing of perf_event_paranoid message, from David Ahern.
- Handle realloc failures in 'perf kvm', from David Ahern.
- Fix divide by 0 in variance, from David Ahern.
- Save parent pid in thread struct, from David Ahern.
- Handle JITed code in shared memory, from Andi Kleen.
- Fixes for 'perf diff', from Jiri Olsa.
- Remove some unused struct members, from Jiri Olsa.
- Add missing liblk.a dependency for python/perf.so, fix from Jiri
Olsa.
- Respect CROSS_COMPILE in liblk.a, from Rabin Vincent.
- No need to do locking when adding hists in perf report, only 'top'
needs that, from Namhyung Kim.
- Fix alignment of symbol column in in the hists browser (top,
report) when -v is given, from NAmhyung Kim.
- Fix 'perf top' -E option behavior, from Namhyung Kim.
- Fix bug in isupper() and islower(), from Sukadev Bhattiprolu.
- Fix compile errors in bp_signal 'perf test', from Sukadev
Bhattiprolu.
... and more things"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (102 commits)
perf/x86: Disable PEBS-LL in intel_pmu_pebs_disable()
perf/x86: Fix shared register mutual exclusion enforcement
perf/x86/intel: Support full width counting
x86: Add NMI duration tracepoints
perf: Drop sample rate when sampling is too slow
x86: Warn when NMI handlers take large amounts of time
hw_breakpoint: Introduce "struct bp_cpuinfo"
hw_breakpoint: Simplify *register_wide_hw_breakpoint()
hw_breakpoint: Introduce cpumask_of_bp()
hw_breakpoint: Simplify the "weight" usage in toggle_bp_slot() paths
hw_breakpoint: Simplify list/idx mess in toggle_bp_slot() paths
perf/x86/intel: Add mem-loads/stores support for Haswell
perf/x86/intel: Support Haswell/v4 LBR format
perf/x86/intel: Move NMI clearing to end of PMI handler
perf/x86/intel: Add Haswell PEBS support
perf/x86/intel: Add simple Haswell PMU support
perf/x86/intel: Add Haswell PEBS record support
perf/x86/intel: Fix sparse warning
perf/x86/amd: AMD IOMMU Performance Counter PERF uncore PMU implementation
perf/x86/amd: Add IOMMU Performance Counter resource management
...
Reschedule vector tracepoints may be called in cpu idle state.
This causes lockdep check warning below.
The tracepoint requires rcu but for accuracy it also
requires irq_enter() (tracepoints record the irq context), thus,
the tracepoint interrupt handler should be calling irq_enter()
and not rcu_irq_enter() (irq_enter() calls rcu_irq_enter()).
So, add irq_enter/exit() to smp_trace_reschedule_interrupt()
with common pre/post processing functions, smp_entering_irq()
and exiting_irq() (exiting_irq() calls just irq_exit()
in arch/x86/include/asm/apic.h),
because these can be shared among reschedule, call_function,
and call_function_single vectors.
[ 50.720557] Testing event reschedule_exit:
[ 50.721349]
[ 50.721502] ===============================
[ 50.721835] [ INFO: suspicious RCU usage. ]
[ 50.722169] 3.10.0-rc6-00004-gcf910e8 #190 Not tainted
[ 50.722582] -------------------------------
[ 50.722915] /c/kernel-tests/src/linux/arch/x86/include/asm/trace/irq_vectors.h:50 suspicious rcu_dereference_check() usage!
[ 50.723770]
[ 50.723770] other info that might help us debug this:
[ 50.723770]
[ 50.724385]
[ 50.724385] RCU used illegally from idle CPU!
[ 50.724385] rcu_scheduler_active = 1, debug_locks = 0
[ 50.725232] RCU used illegally from extended quiescent state!
[ 50.725690] no locks held by swapper/0/0.
[ 50.726010]
[ 50.726010] stack backtrace:
[...]
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/51CDCFA3.9080101@hds.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
* acpi-pm:
ACPI / PM: Rework and clean up acpi_dev_pm_get_state()
ACPI / PM: Replace ACPI_STATE_D3 with ACPI_STATE_D3_COLD in device_pm.c
ACPI / PM: Rename function acpi_device_power_state() and make it static
ACPI / PM: acpi_processor_suspend() can be static
xen / ACPI / sleep: Register an acpi_suspend_lowlevel callback.
x86 / ACPI / sleep: Provide registration for acpi_suspend_lowlevel.
Pull MCE cleanup from Tony Luck:
"Changes to simplify the SDM means that we can also simplify
the code for SRAR (software recoverable action required) errors."
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linux 3.10-rc7
The sdvo lvds fix in this -fixes pull
commit c3456fb3e4
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date: Mon Jun 10 09:47:58 2013 +0200
drm/i915: prefer VBT modes for SVDO-LVDS over EDID
has a silent functional conflict with
commit 990256aec2
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date: Fri May 31 12:17:07 2013 +0000
drm: Add probed modes in probe order
in drm-next. W simply need to add the vbt modes before edid modes, i.e. the
other way round than now.
Conflicts:
drivers/gpu/drm/drm_prime.c
drivers/gpu/drm/i915/intel_sdvo.c
This patch fixes a problem with the shared registers mutual
exclusion code and incremental event scheduling by the
generic perf_event code.
There was a bug whereby the mutual exclusion on the shared
registers was not enforced because of incremental scheduling
abort due to event constraints. As an example on Intel
Nehalem, consider the following events:
group1= L1D_CACHE_LD:E_STATE,OFFCORE_RESPONSE_0:PF_RFO,L1D_CACHE_LD:I_STATE
group2= L1D_CACHE_LD:I_STATE
The L1D_CACHE_LD event can only be measured by 2 counters. Yet, there
are 3 instances here. The first group can be scheduled and is committed.
Then, the generic code tries to schedule group2 and this fails (because
there is no more counter to support the 3rd instance of L1D_CACHE_LD).
But in x86_schedule_events() error path, put_event_contraints() is invoked
on ALL the events and not just the ones that just failed. That causes the
"lock" on the shared offcore_response MSR to be released. Yet the first group
is actually scheduled and is exposed to reprogramming of that shared msr by
the sibling HT thread. In other words, there is no guarantee on what is
measured.
This patch fixes the problem by tagging committed events with the
PERF_X86_EVENT_COMMITTED tag. In the error path of x86_schedule_events(),
only the events NOT tagged have their constraint released. The tag
is eventually removed when the event in descheduled.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130620164254.GA3556@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull perf fixes from Ingo Molnar:
"Three small fixlets"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hw_breakpoint: Use cpu_possible_mask in {reserve,release}_bp_slot()
hw_breakpoint: Fix cpu check in task_bp_pinned(cpu)
kprobes: Fix arch_prepare_kprobe to handle copy insn failures
Recent Intel CPUs like Haswell and IvyBridge have a new
alternative MSR range for perfctrs that allows writing the full
counter width. Enable this range if the hardware reports it
using a new capability bit.
Currently the perf code queries CPUID to get the counter width,
and sign extends the counter values as needed. The traditional
PERFCTR MSRs always limit to 32bit, even though the counter
internally is larger (usually 48 bits on recent CPUs)
When the new capability is set use the alternative range which
do not have these restrictions.
This lowers the overhead of perf stat slightly because it has to
do less interrupts to accumulate the counter value. On Haswell
it also avoids some problems with TSX aborting when the end of
the counter range is reached.
( See the patch "perf/x86/intel: Avoid checkpointed counters
causing excessive TSX aborts" for more details. )
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1372173153-20215-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull MCE updates from Tony Luck:
"Better comments so we understand our existing machine check
bank bitmaps - prelude to adding another bitmap soon."
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is some confusion about the 'mce_poll_banks' and 'mce_banks_owned'
per-cpu bitmaps. Provide comments so that we all know exactly what these
are used for, and why.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
On one sytem that mtrr range is more then 44bits, in dmesg we have
[ 0.000000] MTRR default type: write-back
[ 0.000000] MTRR fixed ranges enabled:
[ 0.000000] 00000-9FFFF write-back
[ 0.000000] A0000-BFFFF uncachable
[ 0.000000] C0000-DFFFF write-through
[ 0.000000] E0000-FFFFF write-protect
[ 0.000000] MTRR variable ranges enabled:
[ 0.000000] 0 [000080000000-0000FFFFFFFF] mask 3FFF80000000 uncachable
[ 0.000000] 1 [380000000000-38FFFFFFFFFF] mask 3F0000000000 uncachable
[ 0.000000] 2 [000099000000-000099FFFFFF] mask 3FFFFF000000 write-through
[ 0.000000] 3 [00009A000000-00009AFFFFFF] mask 3FFFFF000000 write-through
[ 0.000000] 4 [381FFA000000-381FFBFFFFFF] mask 3FFFFE000000 write-through
[ 0.000000] 5 [381FFC000000-381FFC0FFFFF] mask 3FFFFFF00000 write-through
[ 0.000000] 6 [0000AD000000-0000ADFFFFFF] mask 3FFFFF000000 write-through
[ 0.000000] 7 [0000BD000000-0000BDFFFFFF] mask 3FFFFF000000 write-through
[ 0.000000] 8 disabled
[ 0.000000] 9 disabled
but /proc/mtrr report wrong:
reg00: base=0x080000000 ( 2048MB), size= 2048MB, count=1: uncachable
reg01: base=0x80000000000 (8388608MB), size=1048576MB, count=1: uncachable
reg02: base=0x099000000 ( 2448MB), size= 16MB, count=1: write-through
reg03: base=0x09a000000 ( 2464MB), size= 16MB, count=1: write-through
reg04: base=0x81ffa000000 (8519584MB), size= 32MB, count=1: write-through
reg05: base=0x81ffc000000 (8519616MB), size= 1MB, count=1: write-through
reg06: base=0x0ad000000 ( 2768MB), size= 16MB, count=1: write-through
reg07: base=0x0bd000000 ( 3024MB), size= 16MB, count=1: write-through
reg08: base=0x09b000000 ( 2480MB), size= 16MB, count=1: write-combining
so bit 44 and bit 45 get cut off.
We have problems in arch/x86/kernel/cpu/mtrr/generic.c::generic_get_mtrr().
1. for base, we miss cast base_lo to 64bit before shifting.
Fix that by adding u64 casting.
2. for size, it only can handle 44 bits aka 32bits + page_shift
Fix that with 64bit mask instead of 32bit mask_lo, then range could be
more than 44bits.
At the same time, we need to update size_or_mask for old cpus that does
support cpuid 0x80000008 to get phys_addr. Need to set high 32bits
to all 1s, otherwise will not get correct size for them.
Also fix mtrr_add_page: it should check base and (base + size - 1)
instead of base and size, as base and size could be small but
base + size could bigger enough to be out of boundary. We can
use boot_cpu_data.x86_phys_bits directly to avoid size_or_mask.
So When are we going to have size more than 44bits? that is 16TiB.
after patch we have right ouput:
reg00: base=0x080000000 ( 2048MB), size= 2048MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=1048576MB, count=1: uncachable
reg02: base=0x099000000 ( 2448MB), size= 16MB, count=1: write-through
reg03: base=0x09a000000 ( 2464MB), size= 16MB, count=1: write-through
reg04: base=0x381ffa000000 (58851232MB), size= 32MB, count=1: write-through
reg05: base=0x381ffc000000 (58851264MB), size= 1MB, count=1: write-through
reg06: base=0x0ad000000 ( 2768MB), size= 16MB, count=1: write-through
reg07: base=0x0bd000000 ( 3024MB), size= 16MB, count=1: write-through
reg08: base=0x09b000000 ( 2480MB), size= 16MB, count=1: write-combining
-v2: simply checking in mtrr_add_page according to hpa.
[ hpa: This probably wants to go into -stable only after having sat in
mainline for a bit. It is not a regression. ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371162815-29931-1-git-send-email-yinghai@kernel.org
Cc: <stable@vger.kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Originally, irq_domain_associate_many() was designed to unwind the
mapped irqs on a failure of any individual association. However, that
proved to be a problem with certain IRQ controllers. Some of them only
support a subset of irqs, and will fail when attempting to map a
reserved IRQ. In those cases we want to map as many IRQs as possible, so
instead it is better for irq_domain_associate_many() to make a
best-effort attempt to map irqs, but not fail if any or all of them
don't succeed. If a caller really cares about how many irqs got
associated, then it should instead go back and check that all of the
irqs is cares about were mapped.
The original design open-coded the individual association code into the
body of irq_domain_associate_many(), but with no longer needing to
unwind associations, the code becomes simpler to split out
irq_domain_associate() to contain the bulk of the logic, and
irq_domain_associate_many() to be a simple loop wrapper.
This patch also adds a new error check to the associate path to make
sure it isn't called for an irq larger than the controller can handle,
and adds locking so that the irq_domain_mutex is held while setting up a
new association.
v3: Fixup missing change to irq_domain_add_tree()
v2: Fixup x86 warning. irq_domain_associate_many() no longer returns an
error code, but reports errors to the printk log directly. In the
majority of cases we don't actually want to fail if there is a
problem, but rather log it and still try to boot the system.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
irqdomain: Fix flubbed irq_domain_associate_many refactoring
commit d39046ec72, "irqdomain: Refactor irq_domain_associate_many()" was
missing the following hunk which causes a boot failure on anything using
irq_domain_add_tree() to allocate an irq domain.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Cc: Thomas Gleixner <tglx@linutronix.de>,
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
This patch has been invaluable in my adventures finding
issues in the perf NMI handler. I'm as big a fan of
printk() as anybody is, but using printk() in NMIs is
deadly when they're happening frequently.
Even hacking in trace_printk() ended up eating enough
CPU to throw off some of the measurements I was making.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org
Cc: acme@ghostprotocols.net
Cc: Dave Hansen <dave@sr71.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch keeps track of how long perf's NMI handler is taking,
and also calculates how many samples perf can take a second. If
the sample length times the expected max number of samples
exceeds a configurable threshold, it drops the sample rate.
This way, we don't have a runaway sampling process eating up the
CPU.
This patch can tend to drop the sample rate down to level where
perf doesn't work very well. *BUT* the alternative is that my
system hangs because it spends all of its time handling NMIs.
I'll take a busted performance tool over an entire system that's
busted and undebuggable any day.
BTW, my suspicion is that there's still an underlying bug here.
Using the HPET instead of the TSC is definitely a contributing
factor, but I suspect there are some other things going on.
But, I can't go dig down on a bug like that with my machine
hanging all the time.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org
Cc: acme@ghostprotocols.net
Cc: Dave Hansen <dave@sr71.net>
[ Prettified it a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I have a system which is causing all kinds of problems. It has
8 NUMA nodes, and lots of cores that can fight over cachelines.
If things are not working _perfectly_, then NMIs can take longer
than expected.
If we get too many of them backed up to each other, we can
easily end up in a situation where we are doing nothing *but*
running NMIs. The biggest problem, though, is that this happens
_silently_. You might be lucky to get an hrtimer warning, but
most of the time system simply hangs.
This patch should at least give us some warning before we fall
off the cliff. the warnings look like this:
nmi_handle: perf_event_nmi_handler() took: 26095071 ns
The message is triggered whenever we notice the longest NMI
we've seen to date. You can always view and reset this value
via the debugfs interface if you like.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org
Cc: acme@ghostprotocols.net
Cc: Dave Hansen <dave@sr71.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In case CONFIG_X86_MCE_THRESHOLD and CONFIG_X86_THERMAL_VECTOR
are disabled, kernel build fails as follows.
arch/x86/built-in.o: In function `trace_threshold_interrupt':
(.entry.text+0x122b): undefined reference to `smp_trace_threshold_interrupt'
arch/x86/built-in.o: In function `trace_thermal_interrupt':
(.entry.text+0x132b): undefined reference to `smp_trace_thermal_interrupt'
In this case, trace_threshold_interrupt/trace_thermal_interrupt
are not needed to define.
So, add config option checking to their definitions in entry_64.S.
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/51C58B8A.2080808@hds.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As load_current_idt() is now what is used to update the IDT for the
switches needed for NMI, lockdep debug, and for tracing, it must not
call local_irq_save(). This is because one of the users of this is
lockdep, which does tracing of local_irq_save() and when the debug
trap is hit, we need to update the IDT before tracing interrupts
being disabled. As load_current_idt() is used to do this, calling
local_irq_save() which lockdep traces, defeats the point of calling
load_current_idt().
As interrupts are already disabled when used by lockdep and NMI, the
only other user is tracing that can disable interrupts itself. Simply
have the tracing update disable interrupts before calling load_current_idt()
instead of breaking the other users.
Here's the dump that happened:
------------[ cut here ]------------
WARNING: at /work/autotest/nobackup/linux-test.git/kernel/fork.c:1196 copy_process+0x2c3/0x1398()
DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled)
Modules linked in:
CPU: 1 PID: 4570 Comm: gdm-simple-gree Not tainted 3.10.0-rc3-test+ #5
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
ffffffff81d2a7a5 ffff88006ed13d50 ffffffff8192822b ffff88006ed13d90
ffffffff81035f25 ffff8800721c6000 ffff88006ed13da0 0000000001200011
0000000000000000 ffff88006ed5e000 ffff8800721c6000 ffff88006ed13df0
Call Trace:
[<ffffffff8192822b>] dump_stack+0x19/0x1b
[<ffffffff81035f25>] warn_slowpath_common+0x67/0x80
[<ffffffff81035fe1>] warn_slowpath_fmt+0x46/0x48
[<ffffffff812bfc5d>] ? __raw_spin_lock_init+0x31/0x52
[<ffffffff810341f7>] copy_process+0x2c3/0x1398
[<ffffffff8103539d>] do_fork+0xa8/0x260
[<ffffffff810ca7b1>] ? trace_preempt_on+0x2a/0x2f
[<ffffffff812afb3e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff81937fe7>] ? sysret_check+0x1b/0x56
[<ffffffff81937fe7>] ? sysret_check+0x1b/0x56
[<ffffffff810355cf>] SyS_clone+0x16/0x18
[<ffffffff81938369>] stub_clone+0x69/0x90
[<ffffffff81937fc2>] ? system_call_fastpath+0x16/0x1b
---[ end trace 8b157a9d20ca1aa2 ]---
in fork.c:
#ifdef CONFIG_PROVE_LOCKING
DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled); <-- bug here
DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
#endif
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull x86 fixes from Peter Anvin:
"This series fixes a couple of build failures, and fixes MTRR cleanup
and memory setup on very specific memory maps.
Finally, it fixes triggering backtraces on all CPUs, which was
inadvertently disabled on x86."
* 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/efi: Fix dummy variable buffer allocation
x86: Fix trigger_all_cpu_backtrace() implementation
x86: Fix section mismatch on load_ucode_ap
x86: fix build error and kconfig for ia32_emulation and binfmt
range: Do not add new blank slot with add_range_with_merge
x86, mtrr: Fix original mtrr range get for mtrr_cleanup
Pull KVM fixes from Paolo Bonzini:
"Three one-line fixes for my first pull request; one for x86 host, one
for x86 guest, one for PPC"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
x86: kvmclock: zero initialize pvclock shared memory area
kvm/ppc/booke: Delay kvmppc_lazy_ee_enable
KVM: x86: remove vcpu's CPL check in host-invoked XCR set