Pull cpu hotplug updates from Thomas Gleixner:
"This is the first part of the ongoing cpu hotplug rework:
- Initial implementation of the state machine
- Runs all online and prepare down callbacks on the plugged cpu and
not on some random processor
- Replaces busy loop waiting with completions
- Adds tracepoints so the states can be followed"
More detailed commentary on this work from an earlier email:
"What's wrong with the current cpu hotplug infrastructure?
- Asymmetry
The hotplug notifier mechanism is asymmetric versus the bringup and
teardown. This is mostly caused by the notifier mechanism.
- Largely undocumented dependencies
While some notifiers use explicitely defined notifier priorities,
we have quite some notifiers which use numerical priorities to
express dependencies without any documentation why.
- Control processor driven
Most of the bringup/teardown of a cpu is driven by a control
processor. While it is understandable, that preperatory steps,
like idle thread creation, memory allocation for and initialization
of essential facilities needs to be done before a cpu can boot,
there is no reason why everything else must run on a control
processor. Before this patch series, bringup looks like this:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
bring the rest up
- All or nothing approach
There is no way to do partial bringups. That's something which is
really desired because we waste e.g. at boot substantial amount of
time just busy waiting that the cpu comes to life. That's stupid
as we could very well do preparatory steps and the initial IPI for
other cpus and then go back and do the necessary low level
synchronization with the freshly booted cpu.
- Minimal debuggability
Due to the notifier based design, it's impossible to switch between
two stages of the bringup/teardown back and forth in order to test
the correctness. So in many hotplug notifiers the cancel
mechanisms are either not existant or completely untested.
- Notifier [un]registering is tedious
To [un]register notifiers we need to protect against hotplug at
every callsite. There is no mechanism that bringup/teardown
callbacks are issued on the online cpus, so every caller needs to
do it itself. That also includes error rollback.
What's the new design?
The base of the new design is a symmetric state machine, where both
the control processor and the booting/dying cpu execute a well
defined set of states. Each state is symmetric in the end, except
for some well defined exceptions, and the bringup/teardown can be
stopped and reversed at almost all states.
So the bringup of a cpu will look like this in the future:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
bring itself up
The synchronization step does not require the control cpu to wait.
That mechanism can be done asynchronously via a worker or some
other mechanism.
The teardown can be made very similar, so that the dying cpu cleans
up and brings itself down. Cleanups which need to be done after
the cpu is gone, can be scheduled asynchronously as well.
There is a long way to this, as we need to refactor the notion when a
cpu is available. Today we set the cpu online right after it comes
out of the low level bringup, which is not really correct.
The proper mechanism is to set it to available, i.e. cpu local
threads, like softirqd, hotplug thread etc. can be scheduled on that
cpu, and once it finished all booting steps, it's set to online, so
general workloads can be scheduled on it. The reverse happens on
teardown. First thing to do is to forbid scheduling of general
workloads, then teardown all the per cpu resources and finally shut it
off completely.
This patch series implements the basic infrastructure for this at the
core level. This includes the following:
- Basic state machine implementation with well defined states, so
ordering and prioritization can be expressed.
- Interfaces to [un]register state callbacks
This invokes the bringup/teardown callback on all online cpus with
the proper protection in place and [un]installs the callbacks in
the state machine array.
For callbacks which have no particular ordering requirement we have
a dynamic state space, so that drivers don't have to register an
explicit hotplug state.
If a callback fails, the code automatically does a rollback to the
previous state.
- Sysfs interface to drive the state machine to a particular step.
This is only partially functional today. Full functionality and
therefor testability will be achieved once we converted all
existing hotplug notifiers over to the new scheme.
- Run all CPU_ONLINE/DOWN_PREPARE notifiers on the booting/dying
processor:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
wait for boot
bring itself up
Signal completion to control cpu
In a previous step of this work we've done a full tree mechanical
conversion of all hotplug notifiers to the new scheme. The balance
is a net removal of about 4000 lines of code.
This is not included in this series, as we decided to take a
different approach. Instead of mechanically converting everything
over, we will do a proper overhaul of the usage sites one by one so
they nicely fit into the symmetric callback scheme.
I decided to do that after I looked at the ugliness of some of the
converted sites and figured out that their hotplug mechanism is
completely buggered anyway. So there is no point to do a
mechanical conversion first as we need to go through the usage
sites one by one again in order to achieve a full symmetric and
testable behaviour"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
cpu/hotplug: Document states better
cpu/hotplug: Fix smpboot thread ordering
cpu/hotplug: Remove redundant state check
cpu/hotplug: Plug death reporting race
rcu: Make CPU_DYING_IDLE an explicit call
cpu/hotplug: Make wait for dead cpu completion based
cpu/hotplug: Let upcoming cpu bring itself fully up
arch/hotplug: Call into idle with a proper state
cpu/hotplug: Move online calls to hotplugged cpu
cpu/hotplug: Create hotplug threads
cpu/hotplug: Split out the state walk into functions
cpu/hotplug: Unpark smpboot threads from the state machine
cpu/hotplug: Move scheduler cpu_online notifier to hotplug core
cpu/hotplug: Implement setup/removal interface
cpu/hotplug: Make target state writeable
cpu/hotplug: Add sysfs state interface
cpu/hotplug: Hand in target state to _cpu_up/down
cpu/hotplug: Convert the hotplugged cpu work to a state machine
cpu/hotplug: Convert to a state machine for the control processor
cpu/hotplug: Add tracepoints
...
Pull ram resource handling changes from Ingo Molnar:
"Core kernel resource handling changes to support NVDIMM error
injection.
This tree introduces a new I/O resource type, IORESOURCE_SYSTEM_RAM,
for System RAM while keeping the current IORESOURCE_MEM type bit set
for all memory-mapped ranges (including System RAM) for backward
compatibility.
With this resource flag it no longer takes a strcmp() loop through the
resource tree to find "System RAM" resources.
The new resource type is then used to extend ACPI/APEI error injection
facility to also support NVDIMM"
* 'core-resources-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ACPI/EINJ: Allow memory error injection to NVDIMM
resource: Kill walk_iomem_res()
x86/kexec: Remove walk_iomem_res() call with GART type
x86, kexec, nvdimm: Use walk_iomem_res_desc() for iomem search
resource: Add walk_iomem_res_desc()
memremap: Change region_intersects() to take @flags and @desc
arm/samsung: Change s3c_pm_run_res() to use System RAM type
resource: Change walk_system_ram() to use System RAM type
drivers: Initialize resource entry to zero
xen, mm: Set IORESOURCE_SYSTEM_RAM to System RAM
kexec: Set IORESOURCE_SYSTEM_RAM for System RAM
arch: Set IORESOURCE_SYSTEM_RAM flag for System RAM
ia64: Set System RAM type and descriptor
x86/e820: Set System RAM type and descriptor
resource: Add I/O resource descriptor
resource: Handle resource flags properly
resource: Add System RAM resource type
Functions which the compiler has instrumented for KASAN place poison on
the stack shadow upon entry and remove this poison prior to returning.
In the case of cpuidle, CPUs exit the kernel a number of levels deep in
C code. Any instrumented functions on this critical path will leave
portions of the stack shadow poisoned.
If CPUs lose context and return to the kernel via a cold path, we
restore a prior context saved in __cpu_suspend_enter are forgotten, and
we never remove the poison they placed in the stack shadow area by
functions calls between this and the actual exit of the kernel.
Thus, (depending on stackframe layout) subsequent calls to instrumented
functions may hit this stale poison, resulting in (spurious) KASAN
splats to the console.
To avoid this, clear any stale poison from the idle thread for a CPU
prior to bringing a CPU online.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The prologue of the EFI entry point pushes x29 and x30 onto the stack but
fails to create the stack frame correctly by omitting the assignment of x29
to the new value of the stack pointer. So fix that.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Commit 0f54b14e76 ("arm64: cpufeature: Change read_cpuid() to use
sysreg's mrs_s macro") changed read_cpuid to require a SYS_ prefix on
register names, to allow manual assembly of registers unknown by the
toolchain, using tables in sysreg.h.
This interacts poorly with commit 42b5573403 ("efi/arm64: Check
for h/w support before booting a >4 KB granular kernel"), which is
curretly queued via the tip tree, and uses read_cpuid without a SYS_
prefix. Due to this, a build of next-20160304 fails if EFI and 64K pages
are selected.
To avoid this issue when trees are merged, move the required SYS_
prefixing into read_cpuid, and revert all of the updated callsites to
pass plain register names. This effectively reverts the bulk of commit
0f54b14e76.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
We validate pstate using PSR_MODE32_BIT, which is part of the
user-provided pstate (and cannot be trusted). Also, we conflate
validation of AArch32 and AArch64 pstate values, making the code
difficult to reason about.
Instead, validate the pstate value based on the associated task. The
task may or may not be current (e.g. when using ptrace), so this must be
passed explicitly by callers. To avoid circular header dependencies via
sched.h, is_compat_task is pulled out of asm/ptrace.h.
To make the code possible to reason about, the AArch64 and AArch32
validation is split into separate functions. Software must respect the
RES0 policy for SPSR bits, and thus the kernel mirrors the hardware
policy (RAZ/WI) for bits as-yet unallocated. When these acquire an
architected meaning writes may be permitted (potentially with additional
validation).
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Commit 7175f0591e ("arm64: perf: Enable PMCR long cycle counter bit")
added initial support for a 64-bit cycle counter enabled using PMCR.LC.
Unfortunately, that patch doesn't extend ARMV8_EVTYPE_MASK, so any
attempts to set the enable bit are ignored by armv8pmu_pmcr_write.
This patch extends the mask to include the new bit.
Signed-off-by: Will Deacon <will.deacon@arm.com>
With ARMv8.1 VHE, the architecture is able to (almost) transparently
run the kernel at EL2, despite being written for EL1.
This patch takes care of the "almost" part, mostly preventing the kernel
from dropping from EL2 to EL1, and setting up the HYP configuration.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
The fault decoding process (including computing the IPA in the case
of a permission fault) would be much better done in C code, as we
have a reasonable infrastructure to deal with the VHE/non-VHE
differences.
Let's move the whole thing to C, including the workaround for
erratum 834220, and just patch the odd ESR_EL2 access remaining
in hyp-entry.S.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Add a new ARM64_HAS_VIRT_HOST_EXTN features to indicate that the
CPU has the ARMv8.1 VHE capability.
This will be used to trigger kernel patching in KVM.
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
When secondary cpus are booted through the ACPI parking protocol, the
booted cpu should check that FW has correctly cleared its mailbox entry
point value to make sure the boot process was correctly executed.
The entry point check is carried in the cpu_ops->cpu_postboot method, that
is executed by secondary cpus when entering the kernel with irqs disabled.
The ACPI parking protocol cpu_ops maps/unmaps the mailboxes on the
primary CPU to trigger secondary boot in the cpu_ops->cpu_boot method
and on secondary processors to carry out FW checks on the booted CPU
to verify the boot protocol was successfully executed in the
cpu_ops->cpu_postboot method.
Therefore, the cpu_ops->cpu_postboot method is forced to ioremap/unmap the
mailboxes, which is wrong in that ioremap cannot be safely be carried out
with irqs disabled.
To fix this issue, this patch reshuffles the code so that the mailboxes
are still mapped after the boot processor executes the cpu_ops->cpu_boot
method for a given cpu, and the VA at which a mailbox is mapped for a given
cpu is stashed in the per-cpu data struct so that secondary cpus can
retrieve them in the cpu_ops->cpu_postboot and complete the required
FW checks.
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reported-by: Itaru Kitayama <itaru.kitayama@riken.jp>
Tested-by: Loc Ho <lho@apm.com>
Tested-by: Itaru Kitayama <itaru.kitayama@riken.jp>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Loc Ho <lho@apm.com>
Cc: Itaru Kitayama <itaru.kitayama@riken.jp>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Al Stone <ahs3@redhat.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
On ThunderX T88 pass 1.x through 2.1 parts, broadcast TLBI
instructions may cause the icache to become corrupted if it contains
data for a non-current ASID.
This patch implements the workaround (which invalidates the local
icache when switching the mm) by using code patching.
Signed-off-by: Andrew Pinski <apinski@cavium.com>
Signed-off-by: David Daney <david.daney@cavium.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Currently the .rodata section is actually still executable when DEBUG_RODATA
is enabled. This changes that so the .rodata is actually read only, no execute.
It also adds the .rodata section to the mem_init banner.
Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
[catalin.marinas@arm.com: added vm_struct vmlinux_rodata in map_kernel()]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Now that we have a clear understanding of the sign of a feature,
rename the routines to reflect the sign, so that it is not misused.
The cpuid_feature_extract_field() now accepts a 'sign' parameter.
Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
There is a confusion on whether the values of a feature are signed
or not in ARM. This is not clearly mentioned in the ARM ARM either.
We have dealt most of the bits as signed so far, and marked the
rest as unsigned explicitly. This fixed in ARM ARM and will be rolled
out soon.
Here is the criteria in a nutshell:
1) The fields, which are either signed or unsigned, use increasing
numerical values to indicate an increase in functionality. Thus, if a value
of 0x1 indicates the presence of some instructions, then the 0x2 value will
indicate the presence of those instructions plus some additional instructions
or functionality.
2) For ID field values where the value 0x0 defines that a feature is not present,
the number is an unsigned value.
3) For some features where the feature was made optional or removed after the
start of the definition of the architecture, the value 0x0 is used to
indicate the presence of a feature, and 0xF indicates the absence of the
feature. In these cases, the fields are, in effect, holding signed values.
So with these rules applied, we have only the following fields which are signed and
the rest are unsigned.
a) ID_AA64PFR0_EL1: {FP, ASIMD}
b) ID_AA64MMFR0_EL1: {TGran4K, TGran64K}
c) ID_AA64DFR0_EL1: PMUVer (0xf - PMUv3 not implemented)
d) ID_DFR0_EL1: PerfMon
e) ID_MMFR0_EL1: {InnerShr, OuterShr}
Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Correct the feature bit entries for :
ID_DFR0
ID_MMFR0
to fix the default safe value for some of the bits.
Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Adds a hook for checking whether a secondary CPU has the
features used already by the kernel during early boot, based
on the boot CPU and plugs in the check for ASID size.
The ID_AA64MMFR0_EL1:ASIDBits determines the size of the mm context
id and is used in the early boot to make decisions. The value is
picked up from the Boot CPU and cannot be delayed until other CPUs
are up. If a secondary CPU has a smaller size than that of the Boot
CPU, things will break horribly and the usual SANITY check is not good
enough to prevent the system from crashing. So, crash the system with
enough information.
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
We verify the capabilities of the secondary CPUs only when
hotplug is enabled. The boot time activated CPUs do not
go through the verification by checking whether the system
wide capabilities were initialised or not.
This patch removes the capability check dependency on CONFIG_HOTPLUG_CPU,
to make sure that all the secondary CPUs go through the check.
The boot time activated CPUs will still skip the system wide
capability check. The plan is to hook in a check for CPU features
used by the kernel at early boot up, based on the Boot CPU values.
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
A secondary CPU could fail to come online due to insufficient
capabilities and could simply die or loop in the kernel.
e.g, a CPU with no support for the selected kernel PAGE_SIZE
loops in kernel with MMU turned off.
or a hotplugged CPU which doesn't have one of the advertised
system capability will die during the activation.
There is no way to synchronise the status of the failing CPU
back to the master. This patch solves the issue by adding a
field to the secondary_data which can be updated by the failing
CPU. If the secondary CPU fails even before turning the MMU on,
it updates the status in a special variable reserved in the head.txt
section to make sure that the update can be cache invalidated safely
without possible sharing of cache write back granule.
Here are the possible states :
-1. CPU_MMU_OFF - Initial value set by the master CPU, this value
indicates that the CPU could not turn the MMU on, hence the status
could not be reliably updated in the secondary_data. Instead, the
CPU has updated the status @ __early_cpu_boot_status.
0. CPU_BOOT_SUCCESS - CPU has booted successfully.
1. CPU_KILL_ME - CPU has invoked cpu_ops->die, indicating the
master CPU to synchronise by issuing a cpu_ops->cpu_kill.
2. CPU_STUCK_IN_KERNEL - CPU couldn't invoke die(), instead is
looping in the kernel. This information could be used by say,
kexec to check if it is really safe to do a kexec reboot.
3. CPU_PANIC_KERNEL - CPU detected some serious issues which
requires kernel to crash immediately. The secondary CPU cannot
call panic() until it has initialised the GIC. This flag can
be used to instruct the master to do so.
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
[catalin.marinas@arm.com: conflict resolution]
[catalin.marinas@arm.com: converted "status" from int to long]
[catalin.marinas@arm.com: updated update_early_cpu_boot_status to use str_l]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
This patch moves cpu_die_early to smp.c, where it fits better.
No functional changes, except for adding the necessary checks
for CONFIG_HOTPLUG_CPU.
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Or in other words, make fail_incapable_cpu() reusable.
We use fail_incapable_cpu() to kill a secondary CPU early during the
bringup, which doesn't have the system advertised capabilities.
This patch makes the routine more generic, to kill a secondary
booting CPU, getting rid of the dependency on capability struct.
This can be used by checks which are not necessarily attached to
a capability struct (e.g, cpu ASIDBits).
In that process, renames the function to cpu_die_early() to better
match its functionality. This will be moved to arch/arm64/kernel/smp.c
later.
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
When KASLR is enabled (CONFIG_RANDOMIZE_BASE=y), and entropy has been
provided by the bootloader, randomize the placement of RAM inside the
linear region if sufficient space is available. For instance, on a 4KB
granule/3 levels kernel, the linear region is 256 GB in size, and we can
choose any 1 GB aligned offset that is far enough from the top of the
address space to fit the distance between the start of the lowest memblock
and the top of the highest memblock.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
This adds support for KASLR is implemented, based on entropy provided by
the bootloader in the /chosen/kaslr-seed DT property. Depending on the size
of the address space (VA_BITS) and the page size, the entropy in the
virtual displacement is up to 13 bits (16k/2 levels) and up to 25 bits (all
4 levels), with the sidenote that displacements that result in the kernel
image straddling a 1GB/32MB/512MB alignment boundary (for 4KB/16KB/64KB
granule kernels, respectively) are not allowed, and will be rounded up to
an acceptable value.
If CONFIG_RANDOMIZE_MODULE_REGION_FULL is enabled, the module region is
randomized independently from the core kernel. This makes it less likely
that the location of core kernel data structures can be determined by an
adversary, but causes all function calls from modules into the core kernel
to be resolved via entries in the module PLTs.
If CONFIG_RANDOMIZE_MODULE_REGION_FULL is not enabled, the module region is
randomized by choosing a page aligned 128 MB region inside the interval
[_etext - 128 MB, _stext + 128 MB). This gives between 10 and 14 bits of
entropy (depending on page size), independently of the kernel randomization,
but still guarantees that modules are within the range of relative branch
and jump instructions (with the caveat that, since the module region is
shared with other uses of the vmalloc area, modules may need to be loaded
further away if the module region is exhausted)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
This implements CONFIG_RELOCATABLE, which links the final vmlinux
image with a dynamic relocation section, allowing the early boot code
to perform a relocation to a different virtual address at runtime.
This is a prerequisite for KASLR (CONFIG_RANDOMIZE_BASE).
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Instead of using absolute addresses for both the exception location
and the fixup, use offsets relative to the exception table entry values.
Not only does this cut the size of the exception table in half, it is
also a prerequisite for KASLR, since absolute exception table entries
are subject to dynamic relocation, which is incompatible with the sorting
of the exception table that occurs at build time.
This patch also introduces the _ASM_EXTABLE preprocessor macro (which
exists on x86 as well) and its _asm_extable assembly counterpart, as
shorthands to emit exception table entries.
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Before implementing KASLR for arm64 by building a self-relocating PIE
executable, we have to ensure that values we use before the relocation
routine is executed are not subject to dynamic relocation themselves.
This applies not only to virtual addresses, but also to values that are
supplied by the linker at build time and relocated using R_AARCH64_ABS64
relocations.
So instead, use assemble time constants, or force the use of static
relocations by folding the constants into the instructions.
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Unfortunately, the current way of using the linker to emit build time
constants into the Image header will no longer work once we switch to
the use of PIE executables. The reason is that such constants are emitted
into the binary using R_AARCH64_ABS64 relocations, which are resolved at
runtime, not at build time, and the places targeted by those relocations
will contain zeroes before that.
So refactor the endian swapping linker script constant generation code so
that it emits the upper and lower 32-bit words separately.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
This adds support for emitting PLTs at module load time for relative
branches that are out of range. This is a prerequisite for KASLR, which
may place the kernel and the modules anywhere in the vmalloc area,
making it more likely that branch target offsets exceed the maximum
range of +/- 128 MB.
In this version, I removed the distinction between relocations against
.init executable sections and ordinary executable sections. The reason
is that it is hardly worth the trouble, given that .init.text usually
does not contain that many far branches, and this version now only
reserves PLT entry space for jump and call relocations against undefined
symbols (since symbols defined in the same module can be assumed to be
within +/- 128 MB)
For example, the mac80211.ko module (which is fairly sizable at ~400 KB)
built with -mcmodel=large gives the following relocation counts:
relocs branches unique !local
.text 3925 3347 518 219
.init.text 11 8 7 1
.exit.text 4 4 4 1
.text.unlikely 81 67 36 17
('unique' means branches to unique type/symbol/addend combos, of which
!local is the subset referring to undefined symbols)
IOW, we are only emitting a single PLT entry for the .init sections, and
we are better off just adding it to the core PLT section instead.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
This relaxes the kernel Image placement requirements, so that it
may be placed at any 2 MB aligned offset in physical memory.
This is accomplished by ignoring PHYS_OFFSET when installing
memblocks, and accounting for the apparent virtual offset of
the kernel Image. As a result, virtual address references
below PAGE_OFFSET are correctly mapped onto physical references
into the kernel Image regardless of where it sits in memory.
Special care needs to be taken for dealing with memory limits passed
via mem=, since the generic implementation clips memory top down, which
may clip the kernel image itself if it is loaded high up in memory. To
deal with this case, we simply add back the memory covering the kernel
image, which may result in more memory to be retained than was passed
as a mem= parameter.
Since mem= should not be considered a production feature, a panic notifier
handler is installed that dumps the memory limit at panic time if one was
set.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
This introduces the preprocessor symbol KIMAGE_VADDR which will serve as
the symbolic virtual base of the kernel region, i.e., the kernel's virtual
offset will be KIMAGE_VADDR + TEXT_OFFSET. For now, we define it as being
equal to PAGE_OFFSET, but in the future, it will be moved below it once
we move the kernel virtual mapping out of the linear mapping.
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
This function was introduced by previous commits implementing UAO.
However, it can be replaced with task_thread_info() in
uao_thread_switch() or get_fs() in do_page_fault() (the latter being
called only on the current context, so no need for using the saved
pt_regs).
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
If a CPU supports both Privileged Access Never (PAN) and User Access
Override (UAO), we don't need to disable/re-enable PAN round all
copy_to_user() like calls.
UAO alternatives cause these calls to use the 'unprivileged' load/store
instructions, which are overridden to be the privileged kind when
fs==KERNEL_DS.
This patch changes the copy_to_user() calls to have their PAN toggling
depend on a new composite 'feature' ARM64_ALT_PAN_NOT_UAO.
If both features are detected, PAN will be enabled, but the copy_to_user()
alternatives will not be applied. This means PAN will be enabled all the
time for these functions. If only PAN is detected, the toggling will be
enabled as normal.
This will save the time taken to disable/re-enable PAN, and allow us to
catch copy_to_user() accesses that occur with fs==KERNEL_DS.
Futex and swp-emulation code continue to hang their PAN toggling code on
ARM64_HAS_PAN.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
CPU feature code uses the desc field as a test to find the end of the list,
this means every entry must have a description. This generates noise for
entries in the list that aren't really features, but combinations of them.
e.g.
> CPU features: detected feature: Privileged Access Never
> CPU features: detected feature: PAN and not UAO
These combination features are needed for corner cases with alternatives,
where cpu features interact.
Change all walkers of the arm64_features[] and arm64_hwcaps[] lists to test
'matches' not 'desc', and only print 'desc' if it is non-NULL.
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by : Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
'User Access Override' is a new ARMv8.2 feature which allows the
unprivileged load and store instructions to be overridden to behave in
the normal way.
This patch converts {get,put}_user() and friends to use ldtr*/sttr*
instructions - so that they can only access EL0 memory, then enables
UAO when fs==KERNEL_DS so that these functions can access kernel memory.
This allows user space's read/write permissions to be checked against the
page tables, instead of testing addr<USER_DS, then using the kernel's
read/write permissions.
Signed-off-by: James Morse <james.morse@arm.com>
[catalin.marinas@arm.com: move uao_thread_switch() above dsb()]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
ARMv8.1 increases the PMU event number space to 16 bit so increase
the EVTYPE mask.
Signed-off-by: Jan Glauber <jglauber@cavium.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
With the long cycle counter bit (LC) disabled the cycle counter is not
working on ThunderX SOC (ThunderX only implements Aarch64).
Also, according to documentation LC == 0 is deprecated.
To keep the code simple the patch does not introduce 64 bit wide counter
functions. Instead writing the cycle counter always sets the upper
32 bits so overflow interrupts are generated as before.
Original patch from Andrew Pinksi <Andrew.Pinksi@caviumnetworks.com>
Signed-off-by: Jan Glauber <jglauber@cavium.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
The implemented Cortex A57 events are strictly-speaking not
A57 specific. They are ARM recommended implementation defined events
and can be found on other ARMv8 SOCs like Cavium ThunderX too.
Therefore rename these events to allow using them in other
implementations too.
Signed-off-by: Jan Glauber <jglauber@cavium.com>
[will: capitalisation and ordering]
Signed-off-by: Will Deacon <will.deacon@arm.com>
ARMv8.2 adds a new feature register id_aa64mmfr2. This patch adds the
cpu feature boiler plate used by the actual features in later patches.
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Older assemblers may not have support for newer feature registers. To get
round this, sysreg.h provides a 'mrs_s' macro that takes a register
encoding and generates the raw instruction.
Change read_cpuid() to use mrs_s in all cases so that new registers
don't have to be a special case. Including sysreg.h means we need to move
the include and definition of read_cpuid() after the #ifndef __ASSEMBLY__
to avoid syntax errors in vmlinux.lds.
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Although the arm64 vDSO is cleanly separated by code/data with the
code being read-only in userspace mappings, the code page is still
writable from the kernel. There have been exploits (such as
http://itszn.com/blog/?p=21) that take advantage of this on x86 to go
from a bad kernel write to full root.
Prevent this specific exploit on arm64 by putting the vDSO code page
in read-only memory as well.
Before the change:
[ 3.138366] vdso: 2 pages (1 code @ ffffffc000a71000, 1 data @ ffffffc000a70000)
---[ Kernel Mapping ]---
0xffffffc000000000-0xffffffc000082000 520K RW NX SHD AF UXN MEM/NORMAL
0xffffffc000082000-0xffffffc000200000 1528K ro x SHD AF UXN MEM/NORMAL
0xffffffc000200000-0xffffffc000800000 6M ro x SHD AF BLK UXN MEM/NORMAL
0xffffffc000800000-0xffffffc0009b6000 1752K ro x SHD AF UXN MEM/NORMAL
0xffffffc0009b6000-0xffffffc000c00000 2344K RW NX SHD AF UXN MEM/NORMAL
0xffffffc000c00000-0xffffffc008000000 116M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc00c000000-0xffffffc07f000000 1840M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc800000000-0xffffffc840000000 1G RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc840000000-0xffffffc87ae00000 942M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc87ae00000-0xffffffc87ae70000 448K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87af80000-0xffffffc87af8a000 40K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87af8b000-0xffffffc87b000000 468K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87b000000-0xffffffc87fe00000 78M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc87fe00000-0xffffffc87ff50000 1344K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87ff90000-0xffffffc87ffa0000 64K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87fff0000-0xffffffc880000000 64K RW NX SHD AF UXN MEM/NORMAL
After:
[ 3.138368] vdso: 2 pages (1 code @ ffffffc0006de000, 1 data @ ffffffc000a74000)
---[ Kernel Mapping ]---
0xffffffc000000000-0xffffffc000082000 520K RW NX SHD AF UXN MEM/NORMAL
0xffffffc000082000-0xffffffc000200000 1528K ro x SHD AF UXN MEM/NORMAL
0xffffffc000200000-0xffffffc000800000 6M ro x SHD AF BLK UXN MEM/NORMAL
0xffffffc000800000-0xffffffc0009b8000 1760K ro x SHD AF UXN MEM/NORMAL
0xffffffc0009b8000-0xffffffc000c00000 2336K RW NX SHD AF UXN MEM/NORMAL
0xffffffc000c00000-0xffffffc008000000 116M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc00c000000-0xffffffc07f000000 1840M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc800000000-0xffffffc840000000 1G RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc840000000-0xffffffc87ae00000 942M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc87ae00000-0xffffffc87ae70000 448K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87af80000-0xffffffc87af8a000 40K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87af8b000-0xffffffc87b000000 468K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87b000000-0xffffffc87fe00000 78M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc87fe00000-0xffffffc87ff50000 1344K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87ff90000-0xffffffc87ffa0000 64K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87fff0000-0xffffffc880000000 64K RW NX SHD AF UXN MEM/NORMAL
Inspired by https://lkml.org/lkml/2016/1/19/494 based on work by the
PaX Team, Brad Spengler, and Kees Cook.
Signed-off-by: David Brown <david.brown@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[catalin.marinas@arm.com: removed superfluous __PAGE_ALIGNED_DATA]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>