Jürgen Mell reported an FPU state corruption bug under CONFIG_PREEMPT,
and bisected it to commit v2.6.19-1363-gacc2076, "i386: add sleazy FPU
optimization".
Add tsk_used_math() checks to prevent calling math_state_restore()
which can sleep in the case of !tsk_used_math(). This prevents
making a blocking call in __switch_to().
Apparently "fpu_counter > 5" check is not enough, as in some signal handling
and fork/exec scenarios, fpu_counter > 5 and !tsk_used_math() is possible.
It's a side effect though. This is the failing scenario:
process 'A' in save_i387_ia32() just after clear_used_math()
Got an interrupt and pre-empted out.
At the next context switch to process 'A' again, kernel tries to restore
the math state proactively and sees a fpu_counter > 0 and !tsk_used_math()
This results in init_fpu() during the __switch_to()'s math_state_restore()
And resulting in fpu corruption which will be saved/restored
(save_i387_fxsave and restore_i387_fxsave) during the remaining
part of the signal handling after the context switch.
Bisected-by: Jürgen Mell <j.mell@t-online.de>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Jürgen Mell <j.mell@t-online.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
iommu/gart support misses suspend/resume code, which can do bad stuff,
including memory corruption on resume. Prevent system suspend in case we
would be unable to resume.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Tested-by: Patrick <ragamuffin@datacomm.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix the math emulation that got broken with the recent lazy allocation of FPU
area. init_fpu() need to be added for the math-emulation path aswell
for the FPU area allocation.
math emulation enabled kernel booted fine with this, in the presence
of "no387 nofxsr" boot param.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: hpa@zytor.com
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
this way 32-bit is more similar to 64-bit, and smarter e820 and numa.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It looks good to move bugs_64.c to cpu/bugs_64.c.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
on 64-bit we only get valid max_pfn_mapped after init_memory_mapping().
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
on 32-bit in head_32.S after initial page table is done, we get initial
max_pfn_mapped, and then kernel_physical_mapping_init will give us
a final one.
We need to use that to make sure find_e820_area will get valid addresses
for boot_map and for NODE_DATA(0) on numa32.
XEN PV and lguest may need to assign max_pfn_mapped too.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
make mptable to be consistent with acpi routing, so we could:
1. kexec kernel with acpi=off
2. work around BIOSes where acpi routing is working, but mptable is
not right, so can use kernel/kexec to start other OSes that don't have
good acpi support.
command line: update_mptable
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
we don't need to call memory_present that early.
numa and sparse will call memory_present later and might
even fail, it will call memory_present for the full range.
also for sparse it will call alloc_bootmem ... before we set up bootmem.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
iommu/gart support misses suspend/resume code, which can do bad stuff,
including memory corruption on resume. Prevent system suspend in case we
would be unable to resume.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Tested-by: Patrick <ragamuffin@datacomm.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Update the UV address macros to better describe the
fields of UV physical addresses. Improve comments
in the header files. Add additional MMR definitions.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On Wed, 2008-05-28 at 04:47 +0200, Andi Kleen wrote:
> > So... why not just remove the setting of __GFP_NORETRY? Why is it
> > wrong to oom-kill things in this case?
>
> When the 16MB zone overflows (which can be common in some workloads)
> calling the OOM killer is pretty useless because it has barely any
> real user data [only exception would be the "only 16MB" case Alan
> mentioned]. Killing random processes in this case is bad.
>
> I think for 16MB __GFP_NORETRY is ok because there should be
> nothing freeable in there so looping is useless. Only exception would be the
> "only 16MB total" case again but I'm not sure 2.6 supports that at all
> on x86.
>
> On the other hand d_a_c() does more allocations than just 16MB, especially
> on 64bit and the other zones need different strategies.
Okay, so how about this then ?
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce IRQx_VECTOR on 32-bit, so that #ifdef noise is kept
down. There should be no object code change.
[ mingo@elte.hu: merged to x86/irq not x86/i8259 due to x86/irq having
restructured the vector code into asm-x86/irq_vectors.h, which this
patch touches. ]
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch implements PCI extended configuration space access for
AMD's Barcelona CPUs. It extends the method using CF8/CFC IO
addresses. An x86 capability bit has been introduced that is set for
CPUs supporting PCI extended config space accesses.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
on two node system (16g RAM) with numa config I got this crash:
get_memcfg_from_srat: assigning address to rsdp
RSD PTR v0 [ACPIAM]
ACPI: Too big length in RSDT: 92
failed to get NUMA memory information from SRAT table
NUMA - single node, flat memory mode
Node: 0, start_pfn: 0, end_pfn: 153
Setting physnode_map array to node 0 for pfns:
0
...
Pid: 0, comm: swapper Not tainted 2.6.26-rc4 #4
[<80b41289>] hlt_loop+0x0/0x3
[<8011efa0>] ? alloc_remap+0x50/0x70
[<8079e32e>] alloc_node_mem_map+0x5e/0xa0
[<8012e77b>] ? printk+0x1b/0x20
[<80b590f6>] free_area_init_node+0xc6/0x470
[<80b588fc>] ? __alloc_bootmem_node+0x2c/0x50
[<80b58ad8>] ? find_min_pfn_for_node+0x38/0x70
[<8012e77b>] ? printk+0x1b/0x20
[<80b597c4>] free_area_init_nodes+0x254/0x2d0
[<80b544d7>] zone_sizes_init+0x97/0xa0
[<80b48a03>] setup_arch+0x383/0x530
[<8012e77b>] ? printk+0x1b/0x20
[<80b41aa4>] start_kernel+0x64/0x350
[<80b412d8>] i386_start_kernel+0x8/0x10
=======================
this patch increases the acpi table limit to 32.
Also match early_ioremap() with early_iounmap().
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
reserve early numa kva, so it will not clash with new RAMDISK
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
introduce init_pg_table_start, so xen PV could specify the value.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Create a separate centaur_64.c file in the cpu/ dir for
the useful parts to live in.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Create a separate intel_64.c file in the cpu/ dir for
the useful parts to live in.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Create a separate amd_64.c file in the cpu/ dir for
the useful parts to live in.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
arch/x86/kernel/mmconf-fam10h_64.c is missing the prototypes, which
are decalred in arch/x86/kernel/setup_64.c. Move the prototypes and
the inline stubs to the appropriate header file.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The commit
commit 4b82b27770
Author: Cyrill Gorcunov <gorcunov@gmail.com>
Date: Sat May 24 19:36:35 2008 +0400
set nmi_watchdog to NMI_IO_APIC as by default. This causes hangs on some
machines with buggy watchdogs. Fix it - i.e. restore old behaviour.
Thanks to Sitsofe Wheeler and Adrian Bunk for catching the problem
and Maciej W. Rozycki for explanation what is going on there.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
CC: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add pte_flags() to extract the flags from a pte. This is a special
case of pte_val() which is only guaranteed to return the pte's flags
correctly; the page number may be corrupted or missing.
The intent is to allow paravirt implementations to return pte flags
without having to do any translation of the page number (most notably,
Xen).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
add the boot_init_stack_canary() and make the secondary idle threads
use it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
the boot CPU's idle task has a zero stackprotector canary value.
this is a special task that is never forked, so the fork code
does not randomize its canary. Do it when we hit cpu_idle().
Academic sidenote: this means that the early init code runs with a
zero canary and hence the canary becomes predictable for this short,
boot-only amount of time.
Although attack vectors against early init code are very rare, it might
make sense to move this initialization to an earlier point.
(to one of the early init functions that never return - such as
start_kernel())
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The idle threads for non-boot CPUs are a bit special in how they
are created; the result is that these don't have the stack canary
set up properly in their PDA. Easiest fix is to just always set
the PDA up correctly when entering the idle thread; this is a NOP
for the boot cpu.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
fix a bug noticed and fixed by pageexec@freemail.hu.
if built with -fstack-protector-all then we'll have canary checks built
into the __switch_to() function. That does not work well with the
canary-switching code there: while we already use the %rsp of the
new task, we still call __switch_to() whith the previous task's canary
value in the PDA, hence the __switch_to() ssp prologue instructions
will store the previous canary. Then we update the PDA and upon return
from __switch_to() the canary check triggers and we panic.
so update the canary after we have called __switch_to(), where we are
at the same stackframe level as the last stackframe of the next
(and now freshly current) task.
Note: this means that we call __switch_to() [and its sub-functions]
still with the old canary, but that is not a problem, both the previous
and the next task has a high-quality canary. The only (mostly academic)
disadvantage is that the canary of one task may leak onto the stack of
another task, increasing the risk of information leaks, were an attacker
able to read the stack of specific tasks (but not that of others).
To solve this we'll have to reorganize the way we switch tasks, and move
the PDA setting into the switch_to() assembly code. That will happen in
another patch.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
on paravirt enabled 64-bit kernels the paravirt ops do function
calls themselves - which is bad with the stackprotector - for
example pda_init() loads 0 into %gs and then does MSR_GS_BASE
write (which modifies gs.base) - but that MSR write is a function
call on paravirt, which with stackprotector tries to read the
stack canary from the PDA ... crashing the bootup.
the solution was suggested by Arjan van de Ven: to exclude paravirt.c
from stackprotector, too many lowlevel functionality is in it. It's
not like we'll have paravirt functions with character arrays on
their stack anyway...
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Since cpu_online_map is touched (by for_each_online_cpu)
at moment when cpu_callin_map is already filled up we can
get rid of its checking at all
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: hpa@zytor.com
Cc: mingo@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
apic_write_around will be expanded to apic_write in 64bit mode
anyway. Only a few CPUs (well, old CPUs to be precise) requires
such an action. In general it should not hurt and could be cleaned
up for apic_write (just in case)
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: hpa@zytor.com
Cc: mingo@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>