Currently, create_hyp_mappings applies a "one size fits all" page
protection (PAGE_HYP). As we're heading towards separate protections
for different sections, let's make this protection a parameter, and
let the callers pass their prefered protection (PAGE_HYP for everyone
for the time being).
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Add three new files, kexec.h, machine_kexec.c and relocate_kernel.S to the
arm64 architecture that add support for the kexec re-boot mechanism
(CONFIG_KEXEC) on arm64 platforms.
Signed-off-by: Geoff Levand <geoff@infradead.org>
Reviewed-by: James Morse <james.morse@arm.com>
[catalin.marinas@arm.com: removed dead code following James Morse's comments]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Commit 68234df4ea ("arm64: kill flush_cache_all()") removed the global
arm64 routines cpu_reset() and cpu_soft_restart() needed by the arm64
kexec and kdump support. Add back a simplified version of
cpu_soft_restart() with some changes needed for kexec in the new files
cpu_reset.S, and cpu_reset.h.
When a CPU is reset it needs to be put into the exception level it had when
it entered the kernel. Update cpu_soft_restart() to accept an argument
which signals if the reset address should be entered at EL1 or EL2, and
add a new hypercall HVC_SOFT_RESTART which is used for the EL2 switch.
Signed-off-by: Geoff Levand <geoff@infradead.org>
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
kernel/smp.c has a fancy counter that keeps track of the number of CPUs
it marked as not-present and left in cpu_park_loop(). If there are any
CPUs spinning in here, features like kexec or hibernate may release them
by overwriting this memory.
This problem also occurs on machines using spin-tables to release
secondary cores.
After commit 44dbcc93ab ("arm64: Fix behavior of maxcpus=N")
we bring all known cpus into the secondary holding pen, meaning this
memory can't be re-used by kexec or hibernate.
Add a function cpus_are_stuck_in_kernel() to determine if either of these
cases have occurred.
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
[catalin.marinas@arm.com: cherry-picked from mainline for kexec dependency]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
This commit makes a few slight modifications to the efi_call_virt() macro
to get it to work with function pointers that are stored in locations
other than efi.systab->runtime, and renames the macro to
efi_call_virt_pointer(). The majority of the changes here are to pull
these macros up into header files so that they can be accessed from
outside of drivers/firmware/efi/runtime-wrappers.c.
The most significant change not directly related to the code move is to
add an extra "p" argument into the appropriate efi_call macros, and use
that new argument in place of the, formerly hard-coded,
efi.systab->runtime pointer.
The last piece of the puzzle was to add an efi_call_virt() macro back into
drivers/firmware/efi/runtime-wrappers.c to wrap around the new
efi_call_virt_pointer() macro - this was mainly to keep the code from
looking too cluttered by adding a bunch of extra references to
efi.systab->runtime everywhere.
Note that I also broke up the code in the efi_call_virt_pointer() macro a
bit in the process of moving it.
Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roy Franz <roy.franz@linaro.org>
Cc: Russ Anderson <rja@sgi.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/1466839230-12781-5-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge misc fixes from Andrew Morton:
"Two weeks worth of fixes here"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits)
init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
autofs: don't get stuck in a loop if vfs_write() returns an error
mm/page_owner: avoid null pointer dereference
tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
fs/nilfs2: fix potential underflow in call to crc32_le
oom, suspend: fix oom_reaper vs. oom_killer_disable race
ocfs2: disable BUG assertions in reading blocks
mm, compaction: abort free scanner if split fails
mm: prevent KASAN false positives in kmemleak
mm/hugetlb: clear compound_mapcount when freeing gigantic pages
mm/swap.c: flush lru pvecs on compound page arrival
memcg: css_alloc should return an ERR_PTR value on error
memcg: mem_cgroup_migrate() may be called with irq disabled
hugetlb: fix nr_pmds accounting with shared page tables
Revert "mm: disable fault around on emulated access bit architecture"
Revert "mm: make faultaround produce old ptes"
mailmap: add Boris Brezillon's email
mailmap: add Antoine Tenart's email
mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
mm: mempool: kasan: don't poot mempool objects in quarantine
...
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.
{pte,pmd,pud}_alloc_one{_kernel}, late_pgtable_alloc use PGALLOC_GFP for
__get_free_page (aka order-0).
pgd_alloc is slightly more complex because it allocates from pgd_cache
if PGD_SIZE != PAGE_SIZE and PGD_SIZE depends on the configuration
(CONFIG_ARM64_VA_BITS, PAGE_SHIFT and CONFIG_PGTABLE_LEVELS).
As per
config PGTABLE_LEVELS
int
default 2 if ARM64_16K_PAGES && ARM64_VA_BITS_36
default 2 if ARM64_64K_PAGES && ARM64_VA_BITS_42
default 3 if ARM64_64K_PAGES && ARM64_VA_BITS_48
default 3 if ARM64_4K_PAGES && ARM64_VA_BITS_39
default 3 if ARM64_16K_PAGES && ARM64_VA_BITS_47
default 4 if !ARM64_64K_PAGES && ARM64_VA_BITS_48
we should have the following options
CONFIG_ARM64_VA_BITS:48 CONFIG_PGTABLE_LEVELS:4 PAGE_SIZE:4k size:4096 pages:1
CONFIG_ARM64_VA_BITS:48 CONFIG_PGTABLE_LEVELS:4 PAGE_SIZE:16k size:16 pages:1
CONFIG_ARM64_VA_BITS:48 CONFIG_PGTABLE_LEVELS:3 PAGE_SIZE:64k size:512 pages:1
CONFIG_ARM64_VA_BITS:47 CONFIG_PGTABLE_LEVELS:3 PAGE_SIZE:16k size:16384 pages:1
CONFIG_ARM64_VA_BITS:42 CONFIG_PGTABLE_LEVELS:2 PAGE_SIZE:64k size:65536 pages:1
CONFIG_ARM64_VA_BITS:39 CONFIG_PGTABLE_LEVELS:3 PAGE_SIZE:4k size:4096 pages:1
CONFIG_ARM64_VA_BITS:36 CONFIG_PGTABLE_LEVELS:2 PAGE_SIZE:16k size:16384 pages:1
All of them fit into a single page (aka order-0). This means that this
flag has never been actually useful here because it has always been used
only for PAGE_ALLOC_COSTLY requests.
Link: http://lkml.kernel.org/r/1464599699-30131-6-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/smp.c has a fancy counter that keeps track of the number of CPUs
it marked as not-present and left in cpu_park_loop(). If there are any
CPUs spinning in here, features like kexec or hibernate may release them
by overwriting this memory.
This problem also occurs on machines using spin-tables to release
secondary cores.
After commit 44dbcc93ab ("arm64: Fix behavior of maxcpus=N")
we bring all known cpus into the secondary holding pen, meaning this
memory can't be re-used by kexec or hibernate.
Add a function cpus_are_stuck_in_kernel() to determine if either of these
cases have occurred.
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
This patch adds support for ACPI_TABLE_UPGRADE for ARM64
To access initrd image we need to move initialization
of linear mapping a bit earlier.
The implementation of the feature acpi_table_upgrade()
(drivers/acpi/tables.c) works with initrd data represented as an array
in virtual memory. It uses some library utility to find the redefined
tables in that array and iterates over it to copy the data to new
allocated memory. So to access the initrd data via fixmap
we need to rewrite it considerably.
In x86 arch, kernel memory is already mapped by the time when
acpi_table_upgrade() and acpi_boot_table_init() are called so I
think that we can just move this mapping one function earlier too.
Signed-off-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Several places open-code extraction of the EC field from an ESR_ELx
value, in subtly different ways. This is unfortunate duplication and
variation, and the precise logic used to extract the field is a
distraction.
This patch adds a new macro, ESR_ELx_EC(), to extract the EC field from
an ESR_ELx value in a consistent fashion.
Existing open-coded extractions in core arm64 code are moved over to the
new helper. KVM code is left as-is for the moment.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Huang Shijie <shijie.huang@arm.com>
Cc: Dave P Martin <dave.martin@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
The upstream commit 1771c6e1a5
("x86/kasan: instrument user memory access API") added KASAN instrument to
x86 user memory access API, so added such instrument to ARM64 too.
Define __copy_to/from_user in C in order to add kasan_check_read/write call,
rename assembly implementation to __arch_copy_to/from_user.
Tested by test_kasan module.
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
For debugging purposes, it would be nice if we could export page tables
other than the swapper_pg_dir to userspace. To enable this, this patch
refactors the arm64 page table dumping code such that multiple tables
may be registered with the framework, and exported under debugfs.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Laura Abbott <labbott@fedoraproject.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
AArch64 is capable of 128-bit memory accesses without alignment
restrictions, which makes it both possible and highly practical to slurp
up a typical 20-byte IP header in just 2 loads. Implement our own
version of ip_fast_checksum() to take advantage of that, resulting in
considerably fewer instructions and memory accesses than the generic
version. We can also get more optimal code generation for csum_fold() by
defining it a slightly different way round from the generic version, so
throw that into the mix too.
Suggested-by: Luke Starrett <luke.starrett@broadcom.com>
Acked-by: Luke Starrett <luke.starrett@broadcom.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Current versions of gdb do not interoperate cleanly with kgdb on arm64
systems because gdb and kgdb do not use the same register description.
This patch modifies kgdb to work with recent releases of gdb (>= 7.8.1).
Compatibility with gdb (after the patch is applied) is as follows:
gdb-7.6 and earlier Ok
gdb-7.7 series Works if user provides custom target description
gdb-7.8(.0) Works if user provides custom target description
gdb-7.8.1 and later Ok
When commit 44679a4f14 ("arm64: KGDB: Add step debugging support") was
introduced it was paired with a gdb patch that made an incompatible
change to the gdbserver protocol. This patch was eventually merged into
the gdb sources:
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commit;h=a4d9ba85ec5597a6a556afe26b712e878374b9dd
The change to the protocol was mostly made to simplify big-endian support
inside the kernel gdb stub. Unfortunately the gdb project released
gdb-7.7.x and gdb-7.8.0 before the protocol incompatibility was identified
and reversed:
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commit;h=bdc144174bcb11e808b4e73089b850cf9620a7ee
This leaves us in a position where kgdb still uses the no-longer-used
protocol; gdb-7.8.1, which restored the original behaviour, was
released on 2014-10-29.
I don't believe it is possible to detect/correct the protocol
incompatiblity which means the kernel must take a view about which
version of the gdb remote protocol is "correct". This patch takes the
view that the original/current version of the protocol is correct
and that version found in gdb-7.7.x and gdb-7.8.0 is anomalous.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Rather than wait until we observe the lock being free (which might never
happen), we can also return from spin_unlock_wait if we observe that the
lock is now held by somebody else, which implies that it was unlocked
but we just missed seeing it in that state.
Furthermore, in such a scenario there is no longer a need to write back
the value that we loaded, since we know that there has been a lock
hand-off, which is sufficient to publish any stores prior to the
unlock_wait because the ARm architecture ensures that a Store-Release
instruction is multi-copy atomic when observed by a Load-Acquire
instruction.
The litmus test is something like:
AArch64
{
0:X1=x; 0:X3=y;
1:X1=y;
2:X1=y; 2:X3=x;
}
P0 | P1 | P2 ;
MOV W0,#1 | MOV W0,#1 | LDAR W0,[X1] ;
STR W0,[X1] | STLR W0,[X1] | LDR W2,[X3] ;
DMB SY | | ;
LDR W2,[X3] | | ;
exists
(0:X2=0 /\ 2:X0=1 /\ 2:X2=0)
where P0 is doing spin_unlock_wait, P1 is doing spin_unlock and P2 is
doing spin_lock.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Commit d86b8da04d ("arm64: spinlock: serialise spin_unlock_wait against
concurrent lockers") fixed spin_unlock_wait for LL/SC-based atomics under
the premise that the LSE atomics (in particular, the LDADDA instruction)
are indivisible.
Unfortunately, these instructions are only indivisible when used with the
-AL (full ordering) suffix and, consequently, the same issue can
theoretically be observed with LSE atomics, where a later (in program
order) load can be speculated before the write portion of the atomic
operation.
This patch fixes the issue by performing a CAS of the lock once we've
established that it's unlocked, in much the same way as the LL/SC code.
Fixes: d86b8da04d ("arm64: spinlock: serialise spin_unlock_wait against concurrent lockers")
Signed-off-by: Will Deacon <will.deacon@arm.com>
spin_is_locked has grown two very different use-cases:
(1) [The sane case] API functions may require a certain lock to be held
by the caller and can therefore use spin_is_locked as part of an
assert statement in order to verify that the lock is indeed held.
For example, usage of assert_spin_locked.
(2) [The insane case] There are two locks, where a CPU takes one of the
locks and then checks whether or not the other one is held before
accessing some shared state. For example, the "optimized locking" in
ipc/sem.c.
In the latter case, the sequence looks like:
spin_lock(&sem->lock);
if (!spin_is_locked(&sma->sem_perm.lock))
/* Access shared state */
and requires that the spin_is_locked check is ordered after taking the
sem->lock. Unfortunately, since our spinlocks are implemented using a
LDAXR/STXR sequence, the read of &sma->sem_perm.lock can be speculated
before the STXR and consequently return a stale value.
Whilst this hasn't been seen to cause issues in practice, PowerPC fixed
the same issue in 51d7d5205d ("powerpc: Add smp_mb() to
arch_spin_is_locked()") and, although we did something similar for
spin_unlock_wait in d86b8da04d ("arm64: spinlock: serialise
spin_unlock_wait against concurrent lockers") that doesn't actually take
care of ordering against local acquisition of a different lock.
This patch adds an smp_mb() to the start of our arch_spin_is_locked and
arch_spin_unlock_wait routines to ensure that the lock value is always
loaded after any other locks have been taken by the current CPU.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
On arm64, all SoCs we supported so far either have an IOMMU or have bus
addresses equal to CPU addresses.
However, with the Raspberry Pi 3 coming up, this is no longer true. To
allow DMA to work with an AArch64 kernel on those devices, let's allow
devices to have DMA offsets again.
Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Eric Anholt <eric@anholt.net>
In some cases (e.g. the awk for CONFIG_RANDOMIZE_TEXT_OFFSET) we would
like to make use of PAGE_SHIFT outside of code that can include the
usual header files.
Add a new CONFIG_ARM64_PAGE_SHIFT for this, likewise with
ARM64_CONT_SHIFT for consistency.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Commit ab893fb9f1 ("arm64: introduce KIMAGE_VADDR as the virtual
base of the kernel region") logically split KIMAGE_VADDR from
PAGE_OFFSET, and since commit f9040773b7 ("arm64: move kernel
image to base of vmalloc area") the two have been distinct values.
Unfortunately, neither commit updated the comment above these
definitions, which now erroneously states that PAGE_OFFSET is the start
of the kernel image rather than the start of the linear mapping.
This patch fixes said comment, and introduces an explanation of
KIMAGE_VADDR.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
We're missing entries for mlock2, copy_file_range, preadv2 and pwritev2
in our compat syscall table, so hook them up. Only the last two need
compat wrappers.
Signed-off-by: Will Deacon <will.deacon@arm.com>
This patch brings the PER_LINUX32 /proc/cpuinfo format more in line with
the 32-bit ARM one by providing an additional line:
model name : ARMv8 Processor rev X (v8l)
Cc: <stable@vger.kernel.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Since commit 12a0ef7b0a ("arm64: use generic strnlen_user and
strncpy_from_user functions"), the definition of __addr_ok() has been
languishing unused; eradicate the sucker.
CC: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Introduce a new file to hold ACPI based NUMA information parsing from
SRAT and SLIT.
SRAT includes the CPU ACPI ID to Proximity Domain mappings and memory
ranges to Proximity Domain mapping. SLIT has the information of inter
node distances(relative number for access latency).
Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
[rrichter@cavium.com Reworked for numa v10 series ]
Signed-off-by: Robert Richter <rrichter@cavium.com>
[david.daney@cavium.com reorderd and combinded with other patches in
Hanjun Guo's original set, removed get_mpidr_in_madt() and use
acpi_map_madt_entry() instead.]
Signed-off-by: David Daney <david.daney@cavium.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Dennis Chen <dennis.chen@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull second batch of KVM updates from Radim Krčmář:
"General:
- move kvm_stat tool from QEMU repo into tools/kvm/kvm_stat (kvm_stat
had nothing to do with QEMU in the first place -- the tool only
interprets debugfs)
- expose per-vm statistics in debugfs and support them in kvm_stat
(KVM always collected per-vm statistics, but they were summarised
into global statistics)
x86:
- fix dynamic APICv (VMX was improperly configured and a guest could
access host's APIC MSRs, CVE-2016-4440)
- minor fixes
ARM changes from Christoffer Dall:
- new vgic reimplementation of our horribly broken legacy vgic
implementation. The two implementations will live side-by-side
(with the new being the configured default) for one kernel release
and then we'll remove the legacy one.
- fix for a non-critical issue with virtual abort injection to guests"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (70 commits)
tools: kvm_stat: Add comments
tools: kvm_stat: Introduce pid monitoring
KVM: Create debugfs dir and stat files for each VM
MAINTAINERS: Add kvm tools
tools: kvm_stat: Powerpc related fixes
tools: Add kvm_stat man page
tools: Add kvm_stat vm monitor script
kvm:vmx: more complete state update on APICv on/off
KVM: SVM: Add more SVM_EXIT_REASONS
KVM: Unify traced vector format
svm: bitwise vs logical op typo
KVM: arm/arm64: vgic-new: Synchronize changes to active state
KVM: arm/arm64: vgic-new: enable build
KVM: arm/arm64: vgic-new: implement mapped IRQ handling
KVM: arm/arm64: vgic-new: Wire up irqfd injection
KVM: arm/arm64: vgic-new: Add vgic_v2/v3_enable
KVM: arm/arm64: vgic-new: vgic_init: implement map_resources
KVM: arm/arm64: vgic-new: vgic_init: implement vgic_init
KVM: arm/arm64: vgic-new: vgic_init: implement vgic_create
KVM: arm/arm64: vgic-new: vgic_init: implement kvm_vgic_hyp_init
...
KVM/ARM Changes for v4.7 take 2
"The GIC is dead; Long live the GIC"
This set of changes include the new vgic, which is a reimplementation of
our horribly broken legacy vgic implementation. The two implementations
will live side-by-side (with the new being the configured default) for
one kernel release and then we'll remove it.
Also fixes a non-critical issue with virtual abort injection to guests.
When modifying the active state of an interrupt via the MMIO interface,
we should ensure that the write has the intended effect.
If a guest sets an interrupt to active, but that interrupt is already
flushed into a list register on a running VCPU, then that VCPU will
write the active state back into the struct vgic_irq upon returning from
the guest and syncing its state. This is a non-benign race, because the
guest can observe that an interrupt is not active, and it can have a
reasonable expectations that other VCPUs will not ack any IRQs, and then
set the state to active, and expect it to stay that way. Currently we
are not honoring this case.
Thefore, change both the SACTIVE and CACTIVE mmio handlers to stop the
world, change the irq state, potentially queue the irq if we're setting
it to active, and then continue.
We take this chance to slightly optimize these functions by not stopping
the world when touching private interrupts where there is inherently no
possible race.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
For some rare corner cases in our VGIC emulation later we have to stop
the guest to make sure the VGIC state is consistent.
Provide the necessary framework to pause and resume a guest.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Merge updates from Andrew Morton:
- fsnotify fix
- poll() timeout fix
- a few scripts/ tweaks
- debugobjects updates
- the (small) ocfs2 queue
- Minor fixes to kernel/padata.c
- Maybe half of the MM queue
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
mm, page_alloc: restore the original nodemask if the fast path allocation failed
mm, page_alloc: uninline the bad page part of check_new_page()
mm, page_alloc: don't duplicate code in free_pcp_prepare
mm, page_alloc: defer debugging checks of pages allocated from the PCP
mm, page_alloc: defer debugging checks of freed pages until a PCP drain
cpuset: use static key better and convert to new API
mm, page_alloc: inline pageblock lookup in page free fast paths
mm, page_alloc: remove unnecessary variable from free_pcppages_bulk
mm, page_alloc: pull out side effects from free_pages_check
mm, page_alloc: un-inline the bad part of free_pages_check
mm, page_alloc: check multiple page fields with a single branch
mm, page_alloc: remove field from alloc_context
mm, page_alloc: avoid looking up the first zone in a zonelist twice
mm, page_alloc: shortcut watermark checks for order-0 pages
mm, page_alloc: reduce cost of fair zone allocation policy retry
mm, page_alloc: shorten the page allocator fast path
mm, page_alloc: check once if a zone has isolated pageblocks
mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath
mm, page_alloc: simplify last cpupid reset
mm, page_alloc: remove unnecessary initialisation from __alloc_pages_nodemask()
...
I've just discovered that the useful-sounding has_transparent_hugepage()
is actually an architecture-dependent minefield: on some arches it only
builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when
not, but on some of those (arm and arm64) it then gives the wrong
answer; and on mips alone it's marked __init, which would crash if
called later (but so far it has not been called later).
Straighten this out: make it available to all configs, with a sensible
default in asm-generic/pgtable.h, removing its definitions from those
arches (arc, arm, arm64, sparc, tile) which are served by the default,
adding #define has_transparent_hugepage has_transparent_hugepage to
those (mips, powerpc, s390, x86) which need to override the default at
runtime, and removing the __init from mips (but maybe that kind of code
should be avoided after init: set a static variable the first time it's
called).
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Ning Qu <quning@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Vineet Gupta <vgupta@synopsys.com> [arch/arc]
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [arch/s390]
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull IOMMU updates from Joerg Roedel:
"The updates include:
- rate limiting for the VT-d fault handler
- remove statistics code from the AMD IOMMU driver. It is unused and
should be replaced by something more generic if needed
- per-domain pagesize-bitmaps in IOMMU core code to support systems
with different types of IOMMUs
- support for ACPI devices in the AMD IOMMU driver
- 4GB mode support for Mediatek IOMMU driver
- ARM-SMMU updates from Will Deacon:
- support for 64k pages with SMMUv1 implementations (e.g MMU-401)
- remove open-coded 64-bit MMIO accessors
- initial support for 16-bit VMIDs, as supported by some ThunderX
SMMU implementations
- a couple of errata workarounds for silicon in the field
- various fixes here and there"
* tag 'iommu-updates-v4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (44 commits)
iommu/arm-smmu: Use per-domain page sizes.
iommu/amd: Remove statistics code
iommu/dma: Finish optimising higher-order allocations
iommu: Allow selecting page sizes per domain
iommu: of: enforce const-ness of struct iommu_ops
iommu: remove unused priv field from struct iommu_ops
iommu/dma: Implement scatterlist segment merging
iommu/arm-smmu: Clear cache lock bit of ACR
iommu/arm-smmu: Support SMMUv1 64KB supplement
iommu/arm-smmu: Decouple context format from kernel config
iommu/arm-smmu: Tidy up 64-bit/atomic I/O accesses
io-64-nonatomic: Add relaxed accessor variants
iommu/arm-smmu: Work around MMU-500 prefetch errata
iommu/arm-smmu: Convert ThunderX workaround to new method
iommu/arm-smmu: Differentiate specific implementations
iommu/arm-smmu: Workaround for ThunderX erratum #27704
iommu/arm-smmu: Add support for 16 bit VMID
iommu/amd: Move get_device_id() and friends to beginning of file
iommu/amd: Don't use IS_ERR_VALUE to check integer values
iommu/amd: Signedness bug in acpihid_device_group()
...
Pull KVM updates from Paolo Bonzini:
"Small release overall.
x86:
- miscellaneous fixes
- AVIC support (local APIC virtualization, AMD version)
s390:
- polling for interrupts after a VCPU goes to halted state is now
enabled for s390
- use hardware provided information about facility bits that do not
need any hypervisor activity, and other fixes for cpu models and
facilities
- improve perf output
- floating interrupt controller improvements.
MIPS:
- miscellaneous fixes
PPC:
- bugfixes only
ARM:
- 16K page size support
- generic firmware probing layer for timer and GIC
Christoffer Dall (KVM-ARM maintainer) says:
"There are a few changes in this pull request touching things
outside KVM, but they should all carry the necessary acks and it
made the merge process much easier to do it this way."
though actually the irqchip maintainers' acks didn't make it into the
patches. Marc Zyngier, who is both irqchip and KVM-ARM maintainer,
later acked at http://mid.gmane.org/573351D1.4060303@arm.com ('more
formally and for documentation purposes')"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (82 commits)
KVM: MTRR: remove MSR 0x2f8
KVM: x86: make hwapic_isr_update and hwapic_irr_update look the same
svm: Manage vcpu load/unload when enable AVIC
svm: Do not intercept CR8 when enable AVIC
svm: Do not expose x2APIC when enable AVIC
KVM: x86: Introducing kvm_x86_ops.apicv_post_state_restore
svm: Add VMEXIT handlers for AVIC
svm: Add interrupt injection via AVIC
KVM: x86: Detect and Initialize AVIC support
svm: Introduce new AVIC VMCB registers
KVM: split kvm_vcpu_wake_up from kvm_vcpu_kick
KVM: x86: Introducing kvm_x86_ops VCPU blocking/unblocking hooks
KVM: x86: Introducing kvm_x86_ops VM init/destroy hooks
KVM: x86: Rename kvm_apic_get_reg to kvm_lapic_get_reg
KVM: x86: Misc LAPIC changes to expose helper functions
KVM: shrink halt polling even more for invalid wakeups
KVM: s390: set halt polling to 80 microseconds
KVM: halt_polling: provide a way to qualify wakeups during poll
KVM: PPC: Book3S HV: Re-enable XICS fast path for irqfd-generated interrupts
kvm: Conditionally register IRQ bypass consumer
...
Pull arm64 updates from Will Deacon:
- virt_to_page/page_address optimisations
- support for NUMA systems described using device-tree
- support for hibernate/suspend-to-disk
- proper support for maxcpus= command line parameter
- detection and graceful handling of AArch64-only CPUs
- miscellaneous cleanups and non-critical fixes
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (92 commits)
arm64: do not enforce strict 16 byte alignment to stack pointer
arm64: kernel: Fix incorrect brk randomization
arm64: cpuinfo: Missing NULL terminator in compat_hwcap_str
arm64: secondary_start_kernel: Remove unnecessary barrier
arm64: Ensure pmd_present() returns false after pmd_mknotpresent()
arm64: Replace hard-coded values in the pmd/pud_bad() macros
arm64: Implement pmdp_set_access_flags() for hardware AF/DBM
arm64: Fix typo in the pmdp_huge_get_and_clear() definition
arm64: mm: remove unnecessary EXPORT_SYMBOL_GPL
arm64: always use STRICT_MM_TYPECHECKS
arm64: kvm: Fix kvm teardown for systems using the extended idmap
arm64: kaslr: increase randomization granularity
arm64: kconfig: drop CONFIG_RTC_LIB dependency
arm64: make ARCH_SUPPORTS_DEBUG_PAGEALLOC depend on !HIBERNATION
arm64: hibernate: Refuse to hibernate if the boot cpu is offline
arm64: kernel: Add support for hibernate/suspend-to-disk
PM / Hibernate: Call flush_icache_range() on pages restored in-place
arm64: Add new asm macro copy_page
arm64: Promote KERNEL_START/KERNEL_END definitions to a header file
arm64: kernel: Include _AC definition in page.h
...
Some wakeups should not be considered a sucessful poll. For example on
s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
would be considered runnable - letting all vCPUs poll all the time for
transactional like workload, even if one vCPU would be enough.
This can result in huge CPU usage for large guests.
This patch lets architectures provide a way to qualify wakeups if they
should be considered a good/bad wakeups in regard to polls.
For s390 the implementation will fence of halt polling for anything but
known good, single vCPU events. The s390 implementation for floating
interrupts does a wakeup for one vCPU, but the interrupt will be delivered
by whatever CPU checks first for a pending interrupt. We prefer the
woken up CPU by marking the poll of this CPU as "good" poll.
This code will also mark several other wakeup reasons like IPI or
expired timers as "good". This will of course also mark some events as
not sucessful. As KVM on z runs always as a 2nd level hypervisor,
we prefer to not poll, unless we are really sure, though.
This patch successfully limits the CPU usage for cases like uperf 1byte
transactional ping pong workload or wakeup heavy workload like OLTP
while still providing a proper speedup.
This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
wakeups that are considered not good for polling.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
Cc: David Matlack <dmatlack@google.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
[Rename config symbol. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM/ARM Changes for Linux v4.7
Reworks our stage 2 page table handling to have page table manipulation
macros separate from those of the host systems as the underlying
hardware page tables can be configured to be noticably different in
layout from the stage 1 page tables used by the host.
Adds 16K page size support based on the above.
Adds a generic firmware probing layer for the timer and GIC so that KVM
initializes using the same logic based on both ACPI and FDT.
Finally adds support for hardware updating of the access flag.
The ARMv8.1 architecture extensions introduce support for hardware
updates of the access and dirty information in page table entries. With
VTCR_EL2.HA enabled (bit 21), when the CPU accesses an IPA with the
PTE_AF bit cleared in the stage 2 page table, instead of raising an
Access Flag fault to EL2 the CPU sets the actual page table entry bit
(10). To ensure that kernel modifications to the page table do not
inadvertently revert a bit set by hardware updates, certain Stage 2
software pte/pmd operations must be performed atomically.
The main user of the AF bit is the kvm_age_hva() mechanism. The
kvm_age_hva_handler() function performs a "test and clear young" action
on the pte/pmd. This needs to be atomic in respect of automatic hardware
updates of the AF bit. Since the AF bit is in the same position for both
Stage 1 and Stage 2, the patch reuses the existing
ptep_test_and_clear_young() functionality if
__HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG is defined. Otherwise, the
existing pte_young/pte_mkold mechanism is preserved.
The kvm_set_s2pte_readonly() (and the corresponding pmd equivalent) have
to perform atomic modifications in order to avoid a race with updates of
the AF bit. The arm64 implementation has been re-written using
exclusives.
Currently, kvm_set_s2pte_writable() (and pmd equivalent) take a pointer
argument and modify the pte/pmd in place. However, these functions are
only used on local variables rather than actual page table entries, so
it makes more sense to follow the pte_mkwrite() approach for stage 1
attributes. The change to kvm_s2pte_mkwrite() makes it clear that these
functions do not modify the actual page table entries.
The (pte|pmd)_mkyoung() uses on Stage 2 entries (setting the AF bit
explicitly) do not need to be modified since hardware updates of the
dirty status are not supported by KVM, so there is no possibility of
losing such information.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
As a set of driver-provided callbacks and static data, there is no
compelling reason for struct iommu_ops to be mutable in core code, so
enforce const-ness throughout.
Acked-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Currently, pmd_present() only checks for a non-zero value, returning
true even after pmd_mknotpresent() (which only clears the type bits).
This patch converts pmd_present() to using pte_present(), similar to the
other pmd_*() checks. As a side effect, it will return true for
PROT_NONE mappings, though they are not yet used by the kernel with
transparent huge pages.
For consistency, also change pmd_mknotpresent() to only clear the
PMD_SECT_VALID bit, even though the PMD_TABLE_BIT is already 0 for block
mappings (no functional change). The unused PMD_SECT_PROT_NONE
definition is removed as transparent huge pages use the pte page prot
values.
Fixes: 9c7e535fcc ("arm64: mm: Route pmd thp functions through pte equivalents")
Cc: <stable@vger.kernel.org> # 3.15+
Reviewed-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
This patch replaces the hard-coded value 2 with PMD_TABLE_BIT in the
pmd/pud_bad() macros. Note that using these macros on pmd_trans_huge()
entries is giving incorrect results
(pmd_none_or_trans_huge_or_clear_bad() correctly checks for
pmd_trans_huge before pmd_bad).
Additionally, white-space clean-up for pmd_mkclean().
Reviewed-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
The update to the accessed or dirty states for block mappings must be
done atomically on hardware with support for automatic AF/DBM. The
ptep_set_access_flags() function has been fixed as part of commit
66dbd6e61a ("arm64: Implement ptep_set_access_flags() for hardware
AF/DBM"). This patch brings pmdp_set_access_flags() in line with the pte
counterpart.
Fixes: 2f4b829c62 ("arm64: Add support for hardware updates of the access and dirty pte bits")
Cc: <stable@vger.kernel.org> # 4.4.x: 66dbd6e61a: arm64: Implement ptep_set_access_flags() for hardware AF/DBM
Cc: <stable@vger.kernel.org> # 4.3+
Reviewed-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
With hardware AF/DBM support, pmd modifications (transparent huge pages)
should be performed atomically using load/store exclusive. The initial
patches defined the get-and-clear function and __HAVE_ARCH_* macro
without the "huge" word, leaving the pmdp_huge_get_and_clear() to the
default, non-atomic implementation.
Fixes: 2f4b829c62 ("arm64: Add support for hardware updates of the access and dirty pte bits")
Cc: <stable@vger.kernel.org> # 4.3+
Reviewed-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Inspired by the counterpart of powerpc [1], which shows there is no negative
effect on code generation from enabling STRICT_MM_TYPECHECKS with a modern
compiler.
And, Arnd's comment [2] about that patch says STRICT_MM_TYPECHECKS could
be default as long as the architecture can pass structures in registers as
function arguments. ARM64 can do it as long as the size of structure <= 16
bytes. All the page table value types are u64 on ARM64.
The below disassembly demonstrates it, entry is pte_t type:
entry = arch_make_huge_pte(entry, vma, page, writable);
0xffff00000826fc38 <+80>: and x0, x0, #0xfffffffffffffffd
0xffff00000826fc3c <+84>: mov w3, w21
0xffff00000826fc40 <+88>: mov x2, x20
0xffff00000826fc44 <+92>: mov x1, x19
0xffff00000826fc48 <+96>: orr x0, x0, #0x400
0xffff00000826fc4c <+100>: bl 0xffff00000809bcc0 <arch_make_huge_pte>
[1] http://www.spinics.net/lists/linux-mm/msg105951.html
[2] http://www.spinics.net/lists/linux-mm/msg105969.html
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>