The result of in_compat_syscall() can be pictured as:
x86 platform:
---------------------------------------------------
| Arch\syscall | 64-bit | ia32 | x32 |
|-------------------------------------------------|
| x86_64 | false | true | true |
|-------------------------------------------------|
| i686 | | <true> | |
---------------------------------------------------
Other platforms:
-------------------------------------------
| Arch\syscall | 64-bit | compat |
|-----------------------------------------|
| 64-bit | false | true |
|-----------------------------------------|
| 32-bit(?) | | <false> |
-------------------------------------------
As seen, the result of in_compat_syscall() on generic 32-bit platform
differs from i686.
There is no reason for in_compat_syscall() == true on native i686. It also
easy to misread code if the result on native 32-bit platform differs
between arches.
Because of that non arch-specific code has many places with:
if (IS_ENABLED(CONFIG_COMPAT) && in_compat_syscall())
in different variations.
It looks-like the only non-x86 code which uses in_compat_syscall() not
under CONFIG_COMPAT guard is in amd/amdkfd. But according to the commit
a18069c132 ("amdkfd: Disable support for 32-bit user processes"), it
actually should be disabled on native i686.
Rename in_compat_syscall() to in_32bit_syscall() for x86-specific code
and make in_compat_syscall() false under !CONFIG_COMPAT.
A follow on patch will clean up generic users which were forced to check
IS_ENABLED(CONFIG_COMPAT) with in_compat_syscall().
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-efi@vger.kernel.org
Cc: netdev@vger.kernel.org
Link: https://lkml.kernel.org/r/20181012134253.23266-2-dima@arista.com
Two users have reported [1] that they have an "extremely unlikely" system
with more than MAX_PA/2 memory and L1TF mitigation is not effective. In
fact it's a CPU with 36bits phys limit (64GB) and 32GB memory, but due to
holes in the e820 map, the main region is almost 500MB over the 32GB limit:
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000081effffff] usable
Suggestions to use 'mem=32G' to enable the L1TF mitigation while losing the
500MB revealed, that there's an off-by-one error in the check in
l1tf_select_mitigation().
l1tf_pfn_limit() returns the last usable pfn (inclusive) and the range
check in the mitigation path does not take this into account.
Instead of amending the range check, make l1tf_pfn_limit() return the first
PFN which is over the limit which is less error prone. Adjust the other
users accordingly.
[1] https://bugzilla.suse.com/show_bug.cgi?id=1105536
Fixes: 17dbca1193 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Reported-by: George Anchev <studio@anchev.net>
Reported-by: Christopher Snowhill <kode54@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180823134418.17008-1-vbabka@suse.cz
For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
table entry. This sets the high bits in the CPU's address space, thus
making sure to point to not point an unmapped entry to valid cached memory.
Some server system BIOSes put the MMIO mappings high up in the physical
address space. If such an high mapping was mapped to unprivileged users
they could attack low memory by setting such a mapping to PROT_NONE. This
could happen through a special device driver which is not access
protected. Normal /dev/mem is of course access protected.
To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
Valid page mappings are allowed because the system is then unsafe anyways.
It's not expected that users commonly use PROT_NONE on MMIO. But to
minimize any impact this is only enforced if the mapping actually refers to
a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
the check for root.
For mmaps this is straight forward and can be handled in vm_insert_pfn and
in remap_pfn_range().
For mprotect it's a bit trickier. At the point where the actual PTEs are
accessed a lot of state has been changed and it would be difficult to undo
on an error. Since this is a uncommon case use a separate early page talk
walk pass for MMIO PROT_NONE mappings that checks for this condition
early. For non MMIO and non PROT_NONE there are no changes.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Patch series "exec: Pin stack limit during exec".
Attempts to solve problems with the stack limit changing during exec
continue to be frustrated[1][2]. In addition to the specific issues
around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
other places during exec where the stack limit is used and is assumed to
be unchanging. Given the many places it gets used and the fact that it
can be manipulated/raced via setrlimit() and prlimit(), I think the only
way to handle this is to move away from the "current" view of the stack
limit and instead attach it to the bprm, and plumb this down into the
functions that need to know the stack limits. This series implements
the approach.
[1] 04e35f4495 ("exec: avoid RLIMIT_STACK races with prlimit()")
[2] 779f4e1c6c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
[3] to security@kernel.org, "Subject: existing rlimit races?"
This patch (of 3):
Since it is possible that the stack rlimit can change externally during
exec (either via another thread calling setrlimit() or another process
calling prlimit()), provide a way to pass the rlimit down into the
per-architecture mm layout functions so that the rlimit can stay in the
bprm structure instead of sitting in the signal structure until exec is
finalized.
Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Greg KH <greg@kroah.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One thing /dev/mem access APIs should verify is that there's no way
that excessively large pfn's can leak into the high bits of the
page table entry.
In particular, if people can use "very large physical page addresses"
through /dev/mem to set the bits past bit 58 - SOFTW4 and permission
key bits and NX bit, that could *really* confuse the kernel.
We had an earlier attempt:
ce56a86e2a ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses")
... which turned out to be too restrictive (breaking mem=... bootups for example) and
had to be reverted in:
90edaac627 ("Revert "x86/mm: Limit mmap() of /dev/mem to valid physical addresses"")
This v2 attempt modifies the original patch and makes sure that mmap(/dev/mem)
limits the pfns so that it at least fits in the actual pteval_t architecturally:
- Make sure mmap_mem() actually validates that the offset fits in phys_addr_t
( This may be indirectly true due to some other check, but it's not
entirely obvious. )
- Change valid_mmap_phys_addr_range() to just use phys_addr_valid()
on the top byte
( Top byte is sufficient, because mmap_mem() has already checked that
it cannot wrap. )
- Add a few comments about what the valid_phys_addr_range() vs.
valid_mmap_phys_addr_range() difference is.
Signed-off-by: Craig Bergstrom <craigb@google.com>
[ Fixed the checks and added comments. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ Collected the discussion and patches into a commit. ]
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hans Verkuil <hans.verkuil@cisco.com>
Cc: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: Sean Young <sean@mess.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/CA+55aFyEcOMb657vWSmrM13OxmHxC-XxeBmNis=DwVvpJUOogQ@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In case of 5-level paging, the kernel does not place any mapping above
47-bit, unless userspace explicitly asks for it.
Userspace can request an allocation from the full address space by
specifying the mmap address hint above 47-bit.
Nicholas noticed that the current implementation violates this interface:
If user space requests a mapping at the end of the 47-bit address space
with a length which causes the mapping to cross the 47-bit border
(DEFAULT_MAP_WINDOW), then the vma is partially in the address space
below and above.
Sanity check the mmap address hint so that start and end of the resulting
vma are on the same side of the 47-bit border. If that's not the case fall
back to the code path which ignores the address hint and allocate from the
regular address space below 47-bit.
To make the checks consistent, mask out the address hints lower bits
(either PAGE_MASK or huge_page_mask()) instead of using ALIGN() which can
push them up to the next boundary.
[ tglx: Moved the address check to a function and massaged comment and
changelog ]
Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20171115143607.81541-1-kirill.shutemov@linux.intel.com
On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.
To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.
But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.
If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.
A high hint address would only affect the allocation in question, but not
any future mmap()s.
Specifying high hint address on older kernel or on machine without 5-level
paging support is safe. The hint will be ignored and kernel will fall back
to allocation from 47-bit address space.
This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.
The patch puts all machinery in place, but not yet allows userspace to have
mappings above 47-bit -- TASK_SIZE_MAX has to be raised to get the effect.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170716225954.74185-7-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the following commit in 2008:
cc503c1b43 ("x86: PIE executable randomization")
We added a heuristics to treat applications with RLIMIT_STACK configured
to unlimited as legacy. This means:
a) set the mmap_base to 1/3 of address space + randomization and
b) mmap from bottom to top.
This makes some sense as it allows the stack to grow really large. On the
other hand it reduces the address space usable for default mmaps
(without address hint) quite a lot.
We have received a bug report that SAP HANA workload has hit into this
limitation.
We could argue that the user just got what he asked for when setting
up the unlimited stack but to be realistic growing stack up to 1/6
TASK_SIZE (allowed by mmap_base) is pretty much unimited in the real
life. This would give mmap 20TB of additional address space which is
quite nice. Especially when it is much more likely to use that address
space than the reserved stack.
Digging into the history the original implementation of the randomization:
8817210d4d ("[PATCH] x86_64: Flexmap for 32bit and randomized mappings for 64bit")
didn't have this restriction.
So let's try and remove this assumption - hopefully nothing breaks.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: hughd@google.com
Cc: linux-mm@kvack.org
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/tip-86b110d2ae6365ce91cabd37588bc8611770421a@git.kernel.org
[ So I've applied this to tip:x86/mm with a wider Cc: list - if anyone objects to this change please holler. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 1b028f784e introduced two mmap() bases for 32-bit syscalls and for
64-bit syscalls. The mmap() code in x86 was modified to handle the
separation, but the patch series missed to update the hugetlb code.
As a consequence a 32bit application mapping a file on hugetlbfs uses the
64-bit mmap base for address space allocation, which fails.
Adjust the hugetlb mapping code to use the proper bases depending on the
syscall invocation mode (64-bit or compat).
[ tglx: Massaged changelog and switched from asm/compat.h to linux/compat.h ]
Fixes: commit 1b028f784e ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: 0x7f454c46@gmail.com
Cc: linux-mm@kvack.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20170314114126.9280-1-dsafonov@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
mmap() uses a base address, from which it starts to look for a free space
for allocation.
The base address is stored in mm->mmap_base, which is calculated during
exec(). The address depends on task's size, set rlimit for stack, ASLR
randomization. The base depends on the task size and the number of random
bits which are different for 64-bit and 32bit applications.
Due to the fact, that the base address is fixed, its mmap() from a compat
(32bit) syscall issued by a 64bit task will return a address which is based
on the 64bit base address and does not fit into the 32bit address space
(4GB). The returned pointer is truncated to 32bit, which results in an
invalid address.
To solve store a seperate compat address base plus a compat legacy address
base in mm_struct. These bases are calculated at exec() time and can be
used later to address the 32bit compat mmap() issued by 64 bit
applications.
As a consequence of this change 32-bit applications issuing a 64-bit
syscall (after doing a long jump) will get a 64-bit mapping now. Before
this change 32-bit applications always got a 32bit mapping.
[ tglx: Massaged changelog and added a comment ]
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: 0x7f454c46@gmail.com
Cc: linux-mm@kvack.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20170306141721.9188-4-dsafonov@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We are going to split more MM APIs out of <linux/sched.h>, which
will have to be picked up from a couple of .c files.
The APIs that we are going to move are:
arch_pick_mmap_layout()
arch_get_unmapped_area()
arch_get_unmapped_area_topdown()
mm_update_next_owner()
Include the header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 mm updates from Ingo Molnar:
"The main changes in this cycle were:
- Enable full ASLR randomization for 32-bit programs (Hector
Marco-Gisbert)
- Add initial minimal INVPCI support, to flush global mappings (Andy
Lutomirski)
- Add KASAN enhancements (Andrey Ryabinin)
- Fix mmiotrace for huge pages (Karol Herbst)
- ... misc cleanups and small enhancements"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/32: Enable full randomization on i386 and X86_32
x86/mm/kmmio: Fix mmiotrace for hugepages
x86/mm: Avoid premature success when changing page attributes
x86/mm/ptdump: Remove paravirt_enabled()
x86/mm: Fix INVPCID asm constraint
x86/dmi: Switch dmi_remap() from ioremap() [uncached] to ioremap_cache()
x86/mm: If INVPCID is available, use it to flush global mappings
x86/mm: Add a 'noinvpcid' boot option to turn off INVPCID
x86/mm: Add INVPCID helpers
x86/kasan: Write protect kasan zero shadow
x86/kasan: Clear kasan_zero_page after TLB flush
x86/mm/numa: Check for failures in numa_clear_kernel_node_hotplug()
x86/mm/numa: Clean up numa_clear_kernel_node_hotplug()
x86/mm: Make kmap_prot into a #define
x86/mm/32: Set NX in __supported_pte_mask before enabling paging
x86/mm: Streamline and restore probe_memory_block_size()
Currently on i386 and on X86_64 when emulating X86_32 in legacy mode, only
the stack and the executable are randomized but not other mmapped files
(libraries, vDSO, etc.). This patch enables randomization for the
libraries, vDSO and mmap requests on i386 and in X86_32 in legacy mode.
By default on i386 there are 8 bits for the randomization of the libraries,
vDSO and mmaps which only uses 1MB of VA.
This patch preserves the original randomness, using 1MB of VA out of 3GB or
4GB. We think that 1MB out of 3GB is not a big cost for having the ASLR.
The first obvious security benefit is that all objects are randomized (not
only the stack and the executable) in legacy mode which highly increases
the ASLR effectiveness, otherwise the attackers may use these
non-randomized areas. But also sensitive setuid/setgid applications are
more secure because currently, attackers can disable the randomization of
these applications by setting the ulimit stack to "unlimited". This is a
very old and widely known trick to disable the ASLR in i386 which has been
allowed for too long.
Another trick used to disable the ASLR was to set the ADDR_NO_RANDOMIZE
personality flag, but fortunately this doesn't work on setuid/setgid
applications because there is security checks which clear Security-relevant
flags.
This patch always randomizes the mmap_legacy_base address, removing the
possibility to disable the ASLR by setting the stack to "unlimited".
Signed-off-by: Hector Marco-Gisbert <hecmargi@upv.es>
Acked-by: Ismael Ripoll Ripoll <iripoll@upv.es>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/r/1457639460-5242-1-git-send-email-hecmargi@upv.es
Signed-off-by: Ingo Molnar <mingo@kernel.org>
MPX setups private anonymous mapping, but uses vma->vm_ops too.
This can confuse core VM, as it relies on vm->vm_ops to
distinguish file VMAs from anonymous.
As result we will get SIGBUS, because handle_pte_fault() thinks
it's file VMA without vm_ops->fault and it doesn't know how to
handle the situation properly.
Let's fix that by not setting ->vm_ops.
We don't really need ->vm_ops here: MPX VMA can be detected with
VM_MPX flag. And vma_merge() will not merge MPX VMA with non-MPX
VMA, because ->vm_flags won't match.
The only thing left is name of VMA. I'm not sure if it's part of
ABI, or we can just drop it. The patch keep it by providing
arch_vma_name() on x86.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org> # Fixes: 6b7339f4 (mm: avoid setting up anonymous pages into file mapping)
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@sr71.net
Link: http://lkml.kernel.org/r/20150720212958.305CC3E9@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The issue is that the stack for processes is not properly randomized on
64 bit architectures due to an integer overflow.
The affected function is randomize_stack_top() in file
"fs/binfmt_elf.c":
static unsigned long randomize_stack_top(unsigned long stack_top)
{
unsigned int random_variable = 0;
if ((current->flags & PF_RANDOMIZE) &&
!(current->personality & ADDR_NO_RANDOMIZE)) {
random_variable = get_random_int() & STACK_RND_MASK;
random_variable <<= PAGE_SHIFT;
}
return PAGE_ALIGN(stack_top) + random_variable;
return PAGE_ALIGN(stack_top) - random_variable;
}
Note that, it declares the "random_variable" variable as "unsigned int".
Since the result of the shifting operation between STACK_RND_MASK (which
is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):
random_variable <<= PAGE_SHIFT;
then the two leftmost bits are dropped when storing the result in the
"random_variable". This variable shall be at least 34 bits long to hold
the (22+12) result.
These two dropped bits have an impact on the entropy of process stack.
Concretely, the total stack entropy is reduced by four: from 2^28 to
2^30 (One fourth of expected entropy).
This patch restores back the entropy by correcting the types involved
in the operations in the functions randomize_stack_top() and
stack_maxrandom_size().
The successful fix can be tested with:
$ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done
7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0 [stack]
7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0 [stack]
7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0 [stack]
7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0 [stack]
...
Once corrected, the leading bytes should be between 7ffc and 7fff,
rather than always being 7fff.
Signed-off-by: Hector Marco-Gisbert <hecmargi@upv.es>
Signed-off-by: Ismael Ripoll <iripoll@upv.es>
[ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Fixes: CVE-2015-1593
Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.net
Signed-off-by: Borislav Petkov <bp@suse.de>
This reverts commit df54d6fa54.
The commit isn't necessarily wrong, but because it recalculates the
random mmap_base every time, it seems to confuse user memory allocators
that expect contiguous mmap allocations even when the mmap address isn't
specified.
In particular, the MATLAB Java runtime seems to be unhappy. See
https://bugzilla.kernel.org/show_bug.cgi?id=60774
So we'll want to apply the random offset only once, and Radu has a patch
for that. Revert this older commit in order to apply the other one.
Reported-by: Jeff Shorey <shoreyjeff@gmail.com>
Cc: Radu Caragea <sinaelgl@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hpa reported that dfb09f9b7a breaks 32-bit
builds with the following error message:
/home/hpa/kernel/linux-tip.cpu/arch/x86/kernel/cpu/amd.c:437: undefined
reference to `va_align'
/home/hpa/kernel/linux-tip.cpu/arch/x86/kernel/cpu/amd.c:436: undefined
reference to `va_align'
This is due to the fact that va_align is a global in a 64-bit only
compilation unit. Move it to mmap.c where it is visible to both
subarches.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1312633899-1131-1-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This patch provides performance tuning for the "Bulldozer" CPU. With its
shared instruction cache there is a chance of generating an excessive
number of cache cross-invalidates when running specific workloads on the
cores of a compute module.
This excessive amount of cross-invalidations can be observed if cache
lines backed by shared physical memory alias in bits [14:12] of their
virtual addresses, as those bits are used for the index generation.
This patch addresses the issue by clearing all the bits in the [14:12]
slice of the file mapping's virtual address at generation time, thus
forcing those bits the same for all mappings of a single shared library
across processes and, in doing so, avoids instruction cache aliases.
It also adds the command line option "align_va_addr=(32|64|on|off)" with
which virtual address alignment can be enabled for 32-bit or 64-bit x86
individually, or both, or be completely disabled.
This change leaves virtual region address allocation on other families
and/or vendors unaffected.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1312550110-24160-2-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Make sure compiler won't do weird things with limits. Fetching them
twice may return 2 different values after writable limits are
implemented.
We can either use rlimit helpers added in
3e10e716ab or ACCESS_ONCE if not
applicable; this patch uses the helpers.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
LKML-Reference: <1264609942-24621-1-git-send-email-jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Currently we are not including randomized stack size when calculating
mmap_base address in arch_pick_mmap_layout for topdown case. This might
cause that mmap_base starts in the stack reserved area because stack is
randomized by 1GB for 64b (8MB for 32b) and the minimum gap is 128MB.
If the stack really grows down to mmap_base then we can get silent mmap
region overwrite by the stack values.
Let's include maximum stack randomization size into MIN_GAP which is
used as the low bound for the gap in mmap.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
LKML-Reference: <1252400515-6866-1-git-send-email-mhocko@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Stable Team <stable@kernel.org>
mmap_is_ia32 always true for X86_32, or while emulating IA32 on X86_64
Randomization not supported on X86_32 in legacy layout. Both layouts allow
randomization on X86_64.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>