Merge tag 'kvmarm-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm updates for 5.4 - New ITS translation cache - Allow up to 512 CPUs to be supported with GICv3 (for real this time) - Now call kvm_arch_vcpu_blocking early in the blocking sequence - Tidy-up device mappings in S2 when DIC is available - Clean icache invalidation on VMID rollover - General cleanup
This commit is contained in:
@@ -41,10 +41,11 @@ Related CVEs
|
||||
|
||||
The following CVE entries describe Spectre variants:
|
||||
|
||||
============= ======================= =================
|
||||
============= ======================= ==========================
|
||||
CVE-2017-5753 Bounds check bypass Spectre variant 1
|
||||
CVE-2017-5715 Branch target injection Spectre variant 2
|
||||
============= ======================= =================
|
||||
CVE-2019-1125 Spectre v1 swapgs Spectre variant 1 (swapgs)
|
||||
============= ======================= ==========================
|
||||
|
||||
Problem
|
||||
-------
|
||||
@@ -78,6 +79,13 @@ There are some extensions of Spectre variant 1 attacks for reading data
|
||||
over the network, see :ref:`[12] <spec_ref12>`. However such attacks
|
||||
are difficult, low bandwidth, fragile, and are considered low risk.
|
||||
|
||||
Note that, despite "Bounds Check Bypass" name, Spectre variant 1 is not
|
||||
only about user-controlled array bounds checks. It can affect any
|
||||
conditional checks. The kernel entry code interrupt, exception, and NMI
|
||||
handlers all have conditional swapgs checks. Those may be problematic
|
||||
in the context of Spectre v1, as kernel code can speculatively run with
|
||||
a user GS.
|
||||
|
||||
Spectre variant 2 (Branch Target Injection)
|
||||
-------------------------------------------
|
||||
|
||||
@@ -132,6 +140,9 @@ not cover all possible attack vectors.
|
||||
1. A user process attacking the kernel
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Spectre variant 1
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
The attacker passes a parameter to the kernel via a register or
|
||||
via a known address in memory during a syscall. Such parameter may
|
||||
be used later by the kernel as an index to an array or to derive
|
||||
@@ -144,7 +155,40 @@ not cover all possible attack vectors.
|
||||
potentially be influenced for Spectre attacks, new "nospec" accessor
|
||||
macros are used to prevent speculative loading of data.
|
||||
|
||||
Spectre variant 2 attacker can :ref:`poison <poison_btb>` the branch
|
||||
Spectre variant 1 (swapgs)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
An attacker can train the branch predictor to speculatively skip the
|
||||
swapgs path for an interrupt or exception. If they initialize
|
||||
the GS register to a user-space value, if the swapgs is speculatively
|
||||
skipped, subsequent GS-related percpu accesses in the speculation
|
||||
window will be done with the attacker-controlled GS value. This
|
||||
could cause privileged memory to be accessed and leaked.
|
||||
|
||||
For example:
|
||||
|
||||
::
|
||||
|
||||
if (coming from user space)
|
||||
swapgs
|
||||
mov %gs:<percpu_offset>, %reg
|
||||
mov (%reg), %reg1
|
||||
|
||||
When coming from user space, the CPU can speculatively skip the
|
||||
swapgs, and then do a speculative percpu load using the user GS
|
||||
value. So the user can speculatively force a read of any kernel
|
||||
value. If a gadget exists which uses the percpu value as an address
|
||||
in another load/store, then the contents of the kernel value may
|
||||
become visible via an L1 side channel attack.
|
||||
|
||||
A similar attack exists when coming from kernel space. The CPU can
|
||||
speculatively do the swapgs, causing the user GS to get used for the
|
||||
rest of the speculative window.
|
||||
|
||||
Spectre variant 2
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
A spectre variant 2 attacker can :ref:`poison <poison_btb>` the branch
|
||||
target buffer (BTB) before issuing syscall to launch an attack.
|
||||
After entering the kernel, the kernel could use the poisoned branch
|
||||
target buffer on indirect jump and jump to gadget code in speculative
|
||||
@@ -280,11 +324,18 @@ The sysfs file showing Spectre variant 1 mitigation status is:
|
||||
|
||||
The possible values in this file are:
|
||||
|
||||
======================================= =================================
|
||||
'Mitigation: __user pointer sanitation' Protection in kernel on a case by
|
||||
case base with explicit pointer
|
||||
sanitation.
|
||||
======================================= =================================
|
||||
.. list-table::
|
||||
|
||||
* - 'Not affected'
|
||||
- The processor is not vulnerable.
|
||||
* - 'Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers'
|
||||
- The swapgs protections are disabled; otherwise it has
|
||||
protection in the kernel on a case by case base with explicit
|
||||
pointer sanitation and usercopy LFENCE barriers.
|
||||
* - 'Mitigation: usercopy/swapgs barriers and __user pointer sanitization'
|
||||
- Protection in the kernel on a case by case base with explicit
|
||||
pointer sanitation, usercopy LFENCE barriers, and swapgs LFENCE
|
||||
barriers.
|
||||
|
||||
However, the protections are put in place on a case by case basis,
|
||||
and there is no guarantee that all possible attack vectors for Spectre
|
||||
@@ -366,12 +417,27 @@ Turning on mitigation for Spectre variant 1 and Spectre variant 2
|
||||
1. Kernel mitigation
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Spectre variant 1
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
For the Spectre variant 1, vulnerable kernel code (as determined
|
||||
by code audit or scanning tools) is annotated on a case by case
|
||||
basis to use nospec accessor macros for bounds clipping :ref:`[2]
|
||||
<spec_ref2>` to avoid any usable disclosure gadgets. However, it may
|
||||
not cover all attack vectors for Spectre variant 1.
|
||||
|
||||
Copy-from-user code has an LFENCE barrier to prevent the access_ok()
|
||||
check from being mis-speculated. The barrier is done by the
|
||||
barrier_nospec() macro.
|
||||
|
||||
For the swapgs variant of Spectre variant 1, LFENCE barriers are
|
||||
added to interrupt, exception and NMI entry where needed. These
|
||||
barriers are done by the FENCE_SWAPGS_KERNEL_ENTRY and
|
||||
FENCE_SWAPGS_USER_ENTRY macros.
|
||||
|
||||
Spectre variant 2
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
For Spectre variant 2 mitigation, the compiler turns indirect calls or
|
||||
jumps in the kernel into equivalent return trampolines (retpolines)
|
||||
:ref:`[3] <spec_ref3>` :ref:`[9] <spec_ref9>` to go to the target
|
||||
@@ -473,6 +539,12 @@ Mitigation control on the kernel command line
|
||||
Spectre variant 2 mitigation can be disabled or force enabled at the
|
||||
kernel command line.
|
||||
|
||||
nospectre_v1
|
||||
|
||||
[X86,PPC] Disable mitigations for Spectre Variant 1
|
||||
(bounds check bypass). With this option data leaks are
|
||||
possible in the system.
|
||||
|
||||
nospectre_v2
|
||||
|
||||
[X86] Disable all mitigations for the Spectre variant 2
|
||||
|
@@ -2604,7 +2604,7 @@
|
||||
expose users to several CPU vulnerabilities.
|
||||
Equivalent to: nopti [X86,PPC]
|
||||
kpti=0 [ARM64]
|
||||
nospectre_v1 [PPC]
|
||||
nospectre_v1 [X86,PPC]
|
||||
nobp=0 [S390]
|
||||
nospectre_v2 [X86,PPC,S390,ARM64]
|
||||
spectre_v2_user=off [X86]
|
||||
@@ -2965,9 +2965,9 @@
|
||||
nosmt=force: Force disable SMT, cannot be undone
|
||||
via the sysfs control file.
|
||||
|
||||
nospectre_v1 [PPC] Disable mitigations for Spectre Variant 1 (bounds
|
||||
check bypass). With this option data leaks are possible
|
||||
in the system.
|
||||
nospectre_v1 [X86,PPC] Disable mitigations for Spectre Variant 1
|
||||
(bounds check bypass). With this option data leaks are
|
||||
possible in the system.
|
||||
|
||||
nospectre_v2 [X86,PPC_FSL_BOOK3E,ARM64] Disable all mitigations for
|
||||
the Spectre variant 2 (indirect branch prediction)
|
||||
|
@@ -1,162 +0,0 @@
|
||||
===================
|
||||
RISC-V CPU Bindings
|
||||
===================
|
||||
|
||||
The device tree allows to describe the layout of CPUs in a system through
|
||||
the "cpus" node, which in turn contains a number of subnodes (ie "cpu")
|
||||
defining properties for every cpu.
|
||||
|
||||
Bindings for CPU nodes follow the Devicetree Specification, available from:
|
||||
|
||||
https://www.devicetree.org/specifications/
|
||||
|
||||
with updates for 32-bit and 64-bit RISC-V systems provided in this document.
|
||||
|
||||
===========
|
||||
Terminology
|
||||
===========
|
||||
|
||||
This document uses some terminology common to the RISC-V community that is not
|
||||
widely used, the definitions of which are listed here:
|
||||
|
||||
* hart: A hardware execution context, which contains all the state mandated by
|
||||
the RISC-V ISA: a PC and some registers. This terminology is designed to
|
||||
disambiguate software's view of execution contexts from any particular
|
||||
microarchitectural implementation strategy. For example, my Intel laptop is
|
||||
described as having one socket with two cores, each of which has two hyper
|
||||
threads. Therefore this system has four harts.
|
||||
|
||||
=====================================
|
||||
cpus and cpu node bindings definition
|
||||
=====================================
|
||||
|
||||
The RISC-V architecture, in accordance with the Devicetree Specification,
|
||||
requires the cpus and cpu nodes to be present and contain the properties
|
||||
described below.
|
||||
|
||||
- cpus node
|
||||
|
||||
Description: Container of cpu nodes
|
||||
|
||||
The node name must be "cpus".
|
||||
|
||||
A cpus node must define the following properties:
|
||||
|
||||
- #address-cells
|
||||
Usage: required
|
||||
Value type: <u32>
|
||||
Definition: must be set to 1
|
||||
- #size-cells
|
||||
Usage: required
|
||||
Value type: <u32>
|
||||
Definition: must be set to 0
|
||||
|
||||
- cpu node
|
||||
|
||||
Description: Describes a hart context
|
||||
|
||||
PROPERTIES
|
||||
|
||||
- device_type
|
||||
Usage: required
|
||||
Value type: <string>
|
||||
Definition: must be "cpu"
|
||||
- reg
|
||||
Usage: required
|
||||
Value type: <u32>
|
||||
Definition: The hart ID of this CPU node
|
||||
- compatible:
|
||||
Usage: required
|
||||
Value type: <stringlist>
|
||||
Definition: must contain "riscv", may contain one of
|
||||
"sifive,rocket0"
|
||||
- mmu-type:
|
||||
Usage: optional
|
||||
Value type: <string>
|
||||
Definition: Specifies the CPU's MMU type. Possible values are
|
||||
"riscv,sv32"
|
||||
"riscv,sv39"
|
||||
"riscv,sv48"
|
||||
- riscv,isa:
|
||||
Usage: required
|
||||
Value type: <string>
|
||||
Definition: Contains the RISC-V ISA string of this hart. These
|
||||
ISA strings are defined by the RISC-V ISA manual.
|
||||
|
||||
Example: SiFive Freedom U540G Development Kit
|
||||
---------------------------------------------
|
||||
|
||||
This system contains two harts: a hart marked as disabled that's used for
|
||||
low-level system tasks and should be ignored by Linux, and a second hart that
|
||||
Linux is allowed to run on.
|
||||
|
||||
cpus {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <0>;
|
||||
timebase-frequency = <1000000>;
|
||||
cpu@0 {
|
||||
clock-frequency = <1600000000>;
|
||||
compatible = "sifive,rocket0", "riscv";
|
||||
device_type = "cpu";
|
||||
i-cache-block-size = <64>;
|
||||
i-cache-sets = <128>;
|
||||
i-cache-size = <16384>;
|
||||
next-level-cache = <&L15 &L0>;
|
||||
reg = <0>;
|
||||
riscv,isa = "rv64imac";
|
||||
status = "disabled";
|
||||
L10: interrupt-controller {
|
||||
#interrupt-cells = <1>;
|
||||
compatible = "riscv,cpu-intc";
|
||||
interrupt-controller;
|
||||
};
|
||||
};
|
||||
cpu@1 {
|
||||
clock-frequency = <1600000000>;
|
||||
compatible = "sifive,rocket0", "riscv";
|
||||
d-cache-block-size = <64>;
|
||||
d-cache-sets = <64>;
|
||||
d-cache-size = <32768>;
|
||||
d-tlb-sets = <1>;
|
||||
d-tlb-size = <32>;
|
||||
device_type = "cpu";
|
||||
i-cache-block-size = <64>;
|
||||
i-cache-sets = <64>;
|
||||
i-cache-size = <32768>;
|
||||
i-tlb-sets = <1>;
|
||||
i-tlb-size = <32>;
|
||||
mmu-type = "riscv,sv39";
|
||||
next-level-cache = <&L15 &L0>;
|
||||
reg = <1>;
|
||||
riscv,isa = "rv64imafdc";
|
||||
status = "okay";
|
||||
tlb-split;
|
||||
L13: interrupt-controller {
|
||||
#interrupt-cells = <1>;
|
||||
compatible = "riscv,cpu-intc";
|
||||
interrupt-controller;
|
||||
};
|
||||
};
|
||||
};
|
||||
|
||||
Example: Spike ISA Simulator with 1 Hart
|
||||
----------------------------------------
|
||||
|
||||
This device tree matches the Spike ISA golden model as run with `spike -p1`.
|
||||
|
||||
cpus {
|
||||
cpu@0 {
|
||||
device_type = "cpu";
|
||||
reg = <0x00000000>;
|
||||
status = "okay";
|
||||
compatible = "riscv";
|
||||
riscv,isa = "rv64imafdc";
|
||||
mmu-type = "riscv,sv48";
|
||||
clock-frequency = <0x3b9aca00>;
|
||||
interrupt-controller {
|
||||
#interrupt-cells = <0x00000001>;
|
||||
interrupt-controller;
|
||||
compatible = "riscv,cpu-intc";
|
||||
}
|
||||
}
|
||||
}
|
@@ -10,6 +10,18 @@ maintainers:
|
||||
- Paul Walmsley <paul.walmsley@sifive.com>
|
||||
- Palmer Dabbelt <palmer@sifive.com>
|
||||
|
||||
description: |
|
||||
This document uses some terminology common to the RISC-V community
|
||||
that is not widely used, the definitions of which are listed here:
|
||||
|
||||
hart: A hardware execution context, which contains all the state
|
||||
mandated by the RISC-V ISA: a PC and some registers. This
|
||||
terminology is designed to disambiguate software's view of execution
|
||||
contexts from any particular microarchitectural implementation
|
||||
strategy. For example, an Intel laptop containing one socket with
|
||||
two cores, each of which has two hyperthreads, could be described as
|
||||
having four harts.
|
||||
|
||||
properties:
|
||||
compatible:
|
||||
items:
|
||||
@@ -50,6 +62,10 @@ properties:
|
||||
User-Level ISA document, available from
|
||||
https://riscv.org/specifications/
|
||||
|
||||
While the isa strings in ISA specification are case
|
||||
insensitive, letters in the riscv,isa string must be all
|
||||
lowercase to simplify parsing.
|
||||
|
||||
timebase-frequency:
|
||||
type: integer
|
||||
minimum: 1
|
||||
|
@@ -19,7 +19,7 @@ properties:
|
||||
compatible:
|
||||
items:
|
||||
- enum:
|
||||
- sifive,freedom-unleashed-a00
|
||||
- sifive,hifive-unleashed-a00
|
||||
- const: sifive,fu540-c000
|
||||
- const: sifive,fu540
|
||||
...
|
||||
|
@@ -73,7 +73,6 @@ patternProperties:
|
||||
Compatible of the SPI device.
|
||||
|
||||
reg:
|
||||
maxItems: 1
|
||||
minimum: 0
|
||||
maximum: 256
|
||||
description:
|
||||
|
@@ -13,7 +13,8 @@ a) SMB3 (and SMB3.1.1) missing optional features:
|
||||
- T10 copy offload ie "ODX" (copy chunk, and "Duplicate Extents" ioctl
|
||||
currently the only two server side copy mechanisms supported)
|
||||
|
||||
b) improved sparse file support
|
||||
b) improved sparse file support (fiemap and SEEK_HOLE are implemented
|
||||
but additional features would be supportable by the protocol).
|
||||
|
||||
c) Directory entry caching relies on a 1 second timer, rather than
|
||||
using Directory Leases, currently only the root file handle is cached longer
|
||||
@@ -21,9 +22,13 @@ using Directory Leases, currently only the root file handle is cached longer
|
||||
d) quota support (needs minor kernel change since quota calls
|
||||
to make it to network filesystems or deviceless filesystems)
|
||||
|
||||
e) Additional use cases where we use "compoounding" (e.g. open/query/close
|
||||
and open/setinfo/close) to reduce the number of roundtrips, and also
|
||||
open to reduce redundant opens (using deferred close and reference counts more).
|
||||
e) Additional use cases can be optimized to use "compounding"
|
||||
(e.g. open/query/close and open/setinfo/close) to reduce the number
|
||||
of roundtrips to the server and improve performance. Various cases
|
||||
(stat, statfs, create, unlink, mkdir) already have been improved by
|
||||
using compounding but more can be done. In addition we could significantly
|
||||
reduce redundant opens by using deferred close (with handle caching leases)
|
||||
and better using reference counters on file handles.
|
||||
|
||||
f) Finish inotify support so kde and gnome file list windows
|
||||
will autorefresh (partially complete by Asser). Needs minor kernel
|
||||
@@ -43,18 +48,17 @@ mount or a per server basis to client UIDs or nobody if no mapping
|
||||
exists. Also better integration with winbind for resolving SID owners
|
||||
|
||||
k) Add tools to take advantage of more smb3 specific ioctls and features
|
||||
(passthrough ioctl/fsctl for sending various SMB3 fsctls to the server
|
||||
is in progress, and a passthrough query_info call is already implemented
|
||||
in cifs.ko to allow smb3 info levels queries to be sent from userspace)
|
||||
(passthrough ioctl/fsctl is now implemented in cifs.ko to allow sending
|
||||
various SMB3 fsctls and query info and set info calls directly from user space)
|
||||
Add tools to make setting various non-POSIX metadata attributes easier
|
||||
from tools (e.g. extending what was done in smb-info tool).
|
||||
|
||||
l) encrypted file support
|
||||
|
||||
m) improved stats gathering tools (perhaps integration with nfsometer?)
|
||||
to extend and make easier to use what is currently in /proc/fs/cifs/Stats
|
||||
|
||||
n) allow setting more NTFS/SMB3 file attributes remotely (currently limited to compressed
|
||||
file attribute via chflags) and improve user space tools for managing and
|
||||
viewing them.
|
||||
n) Add support for claims based ACLs ("DAC")
|
||||
|
||||
o) mount helper GUI (to simplify the various configuration options on mount)
|
||||
|
||||
@@ -82,6 +86,8 @@ so far).
|
||||
w) Add support for additional strong encryption types, and additional spnego
|
||||
authentication mechanisms (see MS-SMB2)
|
||||
|
||||
x) Finish support for SMB3.1.1 compression
|
||||
|
||||
KNOWN BUGS
|
||||
====================================
|
||||
See http://bugzilla.samba.org - search on product "CifsVFS" for
|
||||
|
@@ -424,13 +424,24 @@ Statistics
|
||||
Following minimum set of TLS-related statistics should be reported
|
||||
by the driver:
|
||||
|
||||
* ``rx_tls_decrypted`` - number of successfully decrypted TLS segments
|
||||
* ``tx_tls_encrypted`` - number of in-order TLS segments passed to device
|
||||
for encryption
|
||||
* ``rx_tls_decrypted_packets`` - number of successfully decrypted RX packets
|
||||
which were part of a TLS stream.
|
||||
* ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets
|
||||
which were successfully decrypted.
|
||||
* ``tx_tls_encrypted_packets`` - number of TX packets passed to the device
|
||||
for encryption of their TLS payload.
|
||||
* ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets
|
||||
passed to the device for encryption.
|
||||
* ``tx_tls_ctx`` - number of TLS TX HW offload contexts added to device for
|
||||
encryption.
|
||||
* ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream
|
||||
but did not arrive in the expected order
|
||||
* ``tx_tls_drop_no_sync_data`` - number of TX packets dropped because
|
||||
they arrived out of order and associated record could not be found
|
||||
but did not arrive in the expected order.
|
||||
* ``tx_tls_drop_no_sync_data`` - number of TX packets which were part of
|
||||
a TLS stream dropped, because they arrived out of order and associated
|
||||
record could not be found.
|
||||
* ``tx_tls_drop_bypass_req`` - number of TX packets which were part of a TLS
|
||||
stream dropped, because they contain both data that has been encrypted by
|
||||
software and data that expects hardware crypto offload.
|
||||
|
||||
Notable corner cases, exceptions and additional requirements
|
||||
============================================================
|
||||
|
@@ -753,8 +753,8 @@ in-kernel irqchip (GIC), and for in-kernel irqchip can tell the GIC to
|
||||
use PPIs designated for specific cpus. The irq field is interpreted
|
||||
like this:
|
||||
|
||||
bits: | 31 ... 24 | 23 ... 16 | 15 ... 0 |
|
||||
field: | irq_type | vcpu_index | irq_id |
|
||||
bits: | 31 ... 28 | 27 ... 24 | 23 ... 16 | 15 ... 0 |
|
||||
field: | vcpu2_index | irq_type | vcpu_index | irq_id |
|
||||
|
||||
The irq_type field has the following values:
|
||||
- irq_type[0]: out-of-kernel GIC: irq_id 0 is IRQ, irq_id 1 is FIQ
|
||||
@@ -766,6 +766,14 @@ The irq_type field has the following values:
|
||||
|
||||
In both cases, level is used to assert/deassert the line.
|
||||
|
||||
When KVM_CAP_ARM_IRQ_LINE_LAYOUT_2 is supported, the target vcpu is
|
||||
identified as (256 * vcpu2_index + vcpu_index). Otherwise, vcpu2_index
|
||||
must be zero.
|
||||
|
||||
Note that on arm/arm64, the KVM_CAP_IRQCHIP capability only conditions
|
||||
injection of interrupts for the in-kernel irqchip. KVM_IRQ_LINE can always
|
||||
be used for a userspace interrupt controller.
|
||||
|
||||
struct kvm_irq_level {
|
||||
union {
|
||||
__u32 irq; /* GSI */
|
||||
|
@@ -237,7 +237,7 @@ The usage pattern is::
|
||||
ret = hmm_range_snapshot(&range);
|
||||
if (ret) {
|
||||
up_read(&mm->mmap_sem);
|
||||
if (ret == -EAGAIN) {
|
||||
if (ret == -EBUSY) {
|
||||
/*
|
||||
* No need to check hmm_range_wait_until_valid() return value
|
||||
* on retry we will get proper error with hmm_range_snapshot()
|
||||
|
Reference in New Issue
Block a user