The Intel Software Developer Manual describes bit 24 in the MCG_CAP
MSR:
MCG_SER_P (software error recovery support present) flag,
bit 24 — Indicates (when set) that the processor supports
software error recovery
But only some models with this capability bit set will actually
generate recoverable machine checks.
Check the model name and set a synthetic capability bit. Provide
a command line option to set this bit anyway in case the kernel
doesn't recognise the model name.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/2e5bfb23c89800a036fb8a45fa97a74bb16bc362.1455732970.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Huge amounts of help from Andy Lutomirski and Borislav Petkov to
produce this. Andy provided the inspiration to add classes to the
exception table with a clever bit-squeezing trick, Boris pointed
out how much cleaner it would all be if we just had a new field.
Linus Torvalds blessed the expansion with:
' I'd rather not be clever in order to save just a tiny amount of space
in the exception table, which isn't really criticial for anybody. '
The third field is another relative function pointer, this one to a
handler that executes the actions.
We start out with three handlers:
1: Legacy - just jumps the to fixup IP
2: Fault - provide the trap number in %ax to the fixup code
3: Cleaned up legacy for the uaccess error hack
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f6af78fcbd348cf4939875cfda9c19689b5e50b8.1455732970.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a second attempt to make the improvements from c6f2062935
("x86/signal/64: Fix SS handling for signals delivered to 64-bit
programs"), which was reverted by 51adbfbba5c6 ("x86/signal/64: Add
support for SS in the 64-bit signal context").
This adds two new uc_flags flags. UC_SIGCONTEXT_SS will be set for
all 64-bit signals (including x32). It indicates that the saved SS
field is valid and that the kernel supports the new behavior.
The goal is to fix a problems with signal handling in 64-bit tasks:
SS wasn't saved in the 64-bit signal context, making it awkward to
determine what SS was at the time of signal delivery and making it
impossible to return to a non-flat SS (as calling sigreturn clobbers
SS).
This also made it extremely difficult for 64-bit tasks to return to
fully-defined 16-bit contexts, because only the kernel can easily do
espfix64, but sigreturn was unable to set a non-flag SS:ESP.
(DOSEMU has a monstrous hack to partially work around this
limitation.)
If we could go back in time, the correct fix would be to make 64-bit
signals work just like 32-bit signals with respect to SS: save it
in signal context, reset it when delivering a signal, and restore
it in sigreturn.
Unfortunately, doing that (as I tried originally) breaks DOSEMU:
DOSEMU wouldn't reset the signal context's SS when clearing the LDT
and changing the saved CS to 64-bit mode, since it predates the SS
context field existing in the first place.
This patch is a bit more complicated, and it tries to balance a
bunch of goals. It makes most cases of changing ucontext->ss during
signal handling work as expected.
I do this by special-casing the interesting case. On sigreturn,
ucontext->ss will be honored by default, unless the ucontext was
created from scratch by an old program and had a 64-bit CS
(unfortunately, CRIU can do this) or was the result of changing a
32-bit signal context to 64-bit without resetting SS (as DOSEMU
does).
For the benefit of new 64-bit software that uses segmentation (new
versions of DOSEMU might), the new behavior can be detected with a
new ucontext flag UC_SIGCONTEXT_SS.
To avoid compilation issues, __pad0 is left as an alias for ss in
ucontext.
The nitty-gritty details are documented in the header file.
This patch also re-enables the sigreturn_64 and ldt_gdt_64 selftests,
as the kernel change allows both of them to pass.
Tested-by: Stas Sergeev <stsp@list.ru>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/749149cbfc3e75cd7fcdad69a854b399d792cc6f.1455664054.git.luto@kernel.org
[ Small readability edit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signals are always delivered to 64-bit tasks with CS set to a long
mode segment. In long mode, SS doesn't matter as long as it's a
present writable segment.
If SS starts out invalid (this can happen if the signal was caused
by an IRET fault or was delivered on the way out of set_thread_area
or modify_ldt), then IRET to the signal handler can fail, eventually
killing the task.
The straightforward fix would be to simply reset SS when delivering
a signal. That breaks DOSEMU, though: 64-bit builds of DOSEMU rely
on SS being set to the faulting SS when signals are delivered.
As a compromise, this patch leaves SS alone so long as it's valid.
The net effect should be that the behavior of successfully delivered
signals is unchanged. Some signals that would previously have
failed to be delivered will now be delivered successfully.
This has no effect for x32 or 32-bit tasks: their signal handlers
were already called with SS == __USER_DS.
(On Xen, there's a slight hole: if a task sets SS to a writable
*kernel* data segment, then we will fail to identify it as invalid
and we'll still kill the task. If anyone cares, this could be fixed
with a new paravirt hook.)
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stas Sergeev <stsp@list.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/163c6e1eacde41388f3ff4d2fe6769be651d7b6e.1455664054.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add an fwnode_handle to the x86 struct pci_sysdata, which will be used to
locate an IRQ domain associated with a root PCI bus.
[bhelgaas: changelog]
Signed-off-by: Jake Oshins <jakeo@microsoft.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
There are two CPUID bits for protection keys. One is for whether
the CPU contains the feature, and the other will appear set once
the OS enables protection keys. Specifically:
Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable
Protection keys (and the RDPKRU/WRPKRU instructions)
This is because userspace can not see CR4 contents, but it can
see CPUID contents.
X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:
CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3]
X86_FEATURE_OSPKE is "OSPKU":
CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]
These are the first CPU features which need to look at the
ECX word in CPUID leaf 0x7, so this patch also includes
fetching that word in to the cpuinfo->x86_capability[] array.
Add it to the disabled-features mask when its config option is
off. Even though we are not using it here, we also extend the
REQUIRED_MASK_BIT_SET() macro to keep it mirroring the
DISABLED_MASK_BIT_SET() version.
This means that in almost all code, you should use:
cpu_has(c, X86_FEATURE_PKU)
and *not* the CONFIG option.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210201.7714C250@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is an XSAVE state component for Intel Processor Trace (PT).
But, we do not currently use it.
We add a placeholder in the code for it so it is not a mystery and
also so we do not need an explicit enum initialization for Protection
Keys in a moment.
Why don't we use it?
We might end up using this at _some_ point in the future. But,
this is a "system" state which requires using the currently
unsupported XSAVES feature. Unlike all the other XSAVE states,
PT state is also not directly tied to a thread. You might
context-switch between threads, but not want to change any of the
PT state. Or, you might switch between threads, and *do* want to
change PT state, all depending on what is being traced.
We currently just manually set some MSRs to do this PT context
switching, and it is unclear whether replacing our direct MSR use
with XSAVE will be a net win or loss, both in code complexity and
performance.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: fenghua.yu@intel.com
Cc: linux-mm@kvack.org
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20160212210158.5E4BCAE2@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When GCC cannot do constant folding for this macro, it falls back to
cpu_has(). But static_cpu_has() is optimal and it works at all times
now. So use it and speedup the fallback case.
Before we had this:
mov 0x99d674(%rip),%rdx # ffffffff81b0d9f4 <boot_cpu_data+0x34>
shr $0x2e,%rdx
and $0x1,%edx
jne ffffffff811704e9 <do_munmap+0x3f9>
After alternatives patching, it turns into:
jmp 0xffffffff81170390
nopl (%rax)
...
callq ffffffff81056e00 <mpx_notify_unmap>
ffffffff81170390: mov 0x170(%r12),%rdi
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1455578358-28347-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
asm/gpio.h is included only by linux/gpio.h, and then only when the arch
selects ARCH_HAVE_CUSTOM_GPIO_H. Only the following arches select it: arm
avr32 blackfin m68k (COLDFIRE only) sh unicore32.
Remove the unused asm/gpio.h files for the arches that do not select
ARCH_HAVE_CUSTOM_GPIO_H.
This is a follow-on to 7563bbf89d ("gpiolib/arches: Centralise
bolierplate asm/gpio.h").
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
The messages should be different depending on the type of error.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
DMI cacheability is very confused on x86.
dmi_early_remap() uses early_ioremap(), which uses FIXMAP_PAGE_IO,
which is __PAGE_KERNEL_IO, which is __PAGE_KERNEL, which is cached.
Don't ask me why this makes any sense.
dmi_remap() uses ioremap(), which requests an uncached mapping.
However, on non-EFI systems, the DMI data generally lives between
0xf0000 and 0x100000, which is in the legacy ISA range, which
triggers a special case in the PAT code that overrides the cache
mode requested by ioremap() and forces a WB mapping.
On a UEFI boot, however, the DMI table can live at any physical
address. On my laptop, it's around 0x77dd0000. That's nowhere near
the legacy ISA range, so the ioremap() implicit uncached type is
honored and we end up with a UC- mapping.
UC- is a very, very slow way to read from main memory, so dmi_walk()
is likely to take much longer than necessary.
Given that, even on UEFI, we do early cached DMI reads, it seems
safe to just ask for cached access. Switch to ioremap_cache().
I haven't tried to benchmark this, but I'd guess it saves several
milliseconds of boot time.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Link: http://lkml.kernel.org/r/3147c38e51f439f3c8911db34c7d4ab22d854915.1453791969.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use vector-hashing to deliver lowest-priority interrupts, As an
example, modern Intel CPUs in server platform use this method to
handle lowest-priority interrupts.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
thread_saved_pc() reads stack of a potentially running task.
This can cause false KASAN stack-out-of-bounds reports,
because the running task concurrently poisons and unpoisons
own stack.
The same happens in get_wchan(), and get get_wchan() was fixed
by using READ_ONCE_NOCHECK(). Do the same here.
Example KASAN report triggered by sysrq-t:
BUG: KASAN: out-of-bounds in sched_show_task+0x306/0x3b0 at addr ffff880043c97c18
Read of size 8 by task syz-executor/23839
[...]
page dumped because: kasan: bad access detected
[...]
Call Trace:
[<ffffffff8175ea0e>] __asan_report_load8_noabort+0x3e/0x40
[<ffffffff813e7a26>] sched_show_task+0x306/0x3b0
[<ffffffff813e7bf4>] show_state_filter+0x124/0x1a0
[<ffffffff82d2ca00>] fn_show_state+0x10/0x20
[<ffffffff82d2cf98>] k_spec+0xa8/0xe0
[<ffffffff82d3354f>] kbd_event+0xb9f/0x4000
[<ffffffff843ca8a7>] input_to_handler+0x3a7/0x4b0
[<ffffffff843d1954>] input_pass_values.part.5+0x554/0x6b0
[<ffffffff843d29bc>] input_handle_event+0x2ac/0x1070
[<ffffffff843d3a47>] input_inject_event+0x237/0x280
[<ffffffff843e8c28>] evdev_write+0x478/0x680
[<ffffffff817ac653>] __vfs_write+0x113/0x480
[<ffffffff817ae0e7>] vfs_write+0x167/0x4a0
[<ffffffff817b13d1>] SyS_write+0x111/0x220
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: glider@google.com
Cc: kasan-dev@googlegroups.com
Cc: kcc@google.com
Cc: linux-kernel@vger.kernel.org
Cc: ryabinin.a.a@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Thomas Gleixner:
"A bit on the largish side due to a series of fixes for a regression in
the x86 vector management which was introduced in 4.3. This work was
started in December already, but it took some time to fix all corner
cases and a couple of older bugs in that area which were detected
while at it
Aside of that a few platform updates for intel-mid, quark and UV and
two fixes for in the mm code:
- Use proper types for pgprot values to avoid truncation
- Prevent a size truncation in the pageattr code when setting page
attributes for large mappings"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/mm/pat: Avoid truncation when converting cpa->numpages to address
x86/mm: Fix types used in pgprot cacheability flags translations
x86/platform/quark: Print boundaries correctly
x86/platform/UV: Remove EFI memmap quirk for UV2+
x86/platform/intel-mid: Join string and fix SoC name
x86/platform/intel-mid: Enable 64-bit build
x86/irq: Plug vector cleanup race
x86/irq: Call irq_force_move_complete with irq descriptor
x86/irq: Remove outgoing CPU from vector cleanup mask
x86/irq: Remove the cpumask allocation from send_cleanup_vector()
x86/irq: Clear move_in_progress before sending cleanup IPI
x86/irq: Remove offline cpus from vector cleanup
x86/irq: Get rid of code duplication
x86/irq: Copy vectormask instead of an AND operation
x86/irq: Check vector allocation early
x86/irq: Reorganize the search in assign_irq_vector
x86/irq: Reorganize the return path in assign_irq_vector
x86/irq: Do not use apic_chip_data.old_domain as temporary buffer
x86/irq: Validate that irq descriptor is still active
x86/irq: Fix a race in x86_vector_free_irqs()
...
So the old one didn't work properly before alternatives had run.
And it was supposed to provide an optimized JMP because the
assumption was that the offset it is jumping to is within a
signed byte and thus a two-byte JMP.
So I did an x86_64 allyesconfig build and dumped all possible
sites where static_cpu_has() was used. The optimization amounted
to all in all 12(!) places where static_cpu_has() had generated
a 2-byte JMP. Which has saved us a whopping 36 bytes!
This clearly is not worth the trouble so we can remove it. The
only place where the optimization might count - in __switch_to()
- we will handle differently. But that's not subject of this
patch.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1453842730-28463-6-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Expose an interface to allow users to mark several accesses together as
being user space accesses, allowing batching of the surrounding user
space access markers (SMAP on x86, PAN on arm64, domain register
switching on arm).
This is currently only used for the user string lenth and copying
functions, where the SMAP overhead on x86 drowned the actual user
accesses (only noticeable on newer microarchitectures that support SMAP
in the first place, of course).
* user access batching branch:
Use the new batched user accesses in generic user string handling
Add 'unsafe' user access functions for batched accesses
x86: reorganize SMAP handling in user space accesses
Merge third patch-bomb from Andrew Morton:
"I'm pretty much done for -rc1 now:
- the rest of MM, basically
- lib/ updates
- checkpatch, epoll, hfs, fatfs, ptrace, coredump, exit
- cpu_mask simplifications
- kexec, rapidio, MAINTAINERS etc, etc.
- more dma-mapping cleanups/simplifications from hch"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (109 commits)
MAINTAINERS: add/fix git URLs for various subsystems
mm: memcontrol: add "sock" to cgroup2 memory.stat
mm: memcontrol: basic memory statistics in cgroup2 memory controller
mm: memcontrol: do not uncharge old page in page cache replacement
Documentation: cgroup: add memory.swap.{current,max} description
mm: free swap cache aggressively if memcg swap is full
mm: vmscan: do not scan anon pages if memcg swap limit is hit
swap.h: move memcg related stuff to the end of the file
mm: memcontrol: replace mem_cgroup_lruvec_online with mem_cgroup_online
mm: vmscan: pass memcg to get_scan_count()
mm: memcontrol: charge swap to cgroup2
mm: memcontrol: clean up alloc, online, offline, free functions
mm: memcontrol: flatten struct cg_proto
mm: memcontrol: rein in the CONFIG space madness
net: drop tcp_memcontrol.c
mm: memcontrol: introduce CONFIG_MEMCG_LEGACY_KMEM
mm: memcontrol: allow to disable kmem accounting for cgroup2
mm: memcontrol: account "kmem" consumers in cgroup2 memory controller
mm: memcontrol: move kmem accounting code to CONFIG_MEMCG
mm: memcontrol: separate kmem code from legacy tcp accounting code
...