Pull x86 apic updates from Ingo Molnar:
"Improve the spreading of managed IRQs at allocation time"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irq/matrix: Spread managed interrupts on allocation
irq/matrix: Split out the CPU selection code into a helper
Pull RAS updates from Ingo Molnar:
"Misc smaller fixes and cleanups"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mcelog: Remove one mce_helper definition
x86/mce: Add macros for the corrected error count bit field
x86/mce: Use BIT_ULL(x) for bit mask definitions
x86/mce-inject: Reset injection struct after injection
Pull perf updates from Ingo Molnar:
"The main updates in this cycle were:
- Lots of perf tooling changes too voluminous to list (big perf trace
and perf stat improvements, lots of libtraceevent reorganization,
etc.), so I'll list the authors and refer to the changelog for
details:
Benjamin Peterson, Jérémie Galarneau, Kim Phillips, Peter
Zijlstra, Ravi Bangoria, Sangwon Hong, Sean V Kelley, Steven
Rostedt, Thomas Gleixner, Ding Xiang, Eduardo Habkost, Thomas
Richter, Andi Kleen, Sanskriti Sharma, Adrian Hunter, Tzvetomir
Stoyanov, Arnaldo Carvalho de Melo, Jiri Olsa.
... with the bulk of the changes written by Jiri Olsa, Tzvetomir
Stoyanov and Arnaldo Carvalho de Melo.
- Continued intel_rdt work with a focus on playing well with perf
events. This also imported some non-perf RDT work due to
dependencies. (Reinette Chatre)
- Implement counter freezing for Arch Perfmon v4 (Skylake and newer).
This allows to speed up the PMI handler by avoiding unnecessary MSR
writes and make it more accurate. (Andi Kleen)
- kprobes cleanups and simplification (Masami Hiramatsu)
- Intel Goldmont PMU updates (Kan Liang)
- ... plus misc other fixes and updates"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (155 commits)
kprobes/x86: Use preempt_enable() in optimized_callback()
x86/intel_rdt: Prevent pseudo-locking from using stale pointers
kprobes, x86/ptrace.h: Make regs_get_kernel_stack_nth() not fault on bad stack
perf/x86/intel: Export mem events only if there's PEBS support
x86/cpu: Drop pointless static qualifier in punit_dev_state_show()
x86/intel_rdt: Fix initial allocation to consider CDP
x86/intel_rdt: CBM overlap should also check for overlap with CDP peer
x86/intel_rdt: Introduce utility to obtain CDP peer
tools lib traceevent, perf tools: Move struct tep_handler definition in a local header file
tools lib traceevent: Separate out tep_strerror() for strerror_r() issues
perf python: More portable way to make CFLAGS work with clang
perf python: Make clang_has_option() work on Python 3
perf tools: Free temporary 'sys' string in read_event_files()
perf tools: Avoid double free in read_event_file()
perf tools: Free 'printk' string in parse_ftrace_printk()
perf tools: Cleanup trace-event-info 'tdata' leak
perf strbuf: Match va_{add,copy} with va_end
perf test: S390 does not support watchpoints in test 22
perf auxtrace: Include missing asm/bitsperlong.h to get BITS_PER_LONG
tools include: Adopt linux/bits.h
...
Pull locking and misc x86 updates from Ingo Molnar:
"Lots of changes in this cycle - in part because locking/core attracted
a number of related x86 low level work which was easier to handle in a
single tree:
- Linux Kernel Memory Consistency Model updates (Alan Stern, Paul E.
McKenney, Andrea Parri)
- lockdep scalability improvements and micro-optimizations (Waiman
Long)
- rwsem improvements (Waiman Long)
- spinlock micro-optimization (Matthew Wilcox)
- qspinlocks: Provide a liveness guarantee (more fairness) on x86.
(Peter Zijlstra)
- Add support for relative references in jump tables on arm64, x86
and s390 to optimize jump labels (Ard Biesheuvel, Heiko Carstens)
- Be a lot less permissive on weird (kernel address) uaccess faults
on x86: BUG() when uaccess helpers fault on kernel addresses (Jann
Horn)
- macrofy x86 asm statements to un-confuse the GCC inliner. (Nadav
Amit)
- ... and a handful of other smaller changes as well"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
locking/lockdep: Make global debug_locks* variables read-mostly
locking/lockdep: Fix debug_locks off performance problem
locking/pvqspinlock: Extend node size when pvqspinlock is configured
locking/qspinlock_stat: Count instances of nested lock slowpaths
locking/qspinlock, x86: Provide liveness guarantee
x86/asm: 'Simplify' GEN_*_RMWcc() macros
locking/qspinlock: Rework some comments
locking/qspinlock: Re-order code
locking/lockdep: Remove duplicated 'lock_class_ops' percpu array
x86/defconfig: Enable CONFIG_USB_XHCI_HCD=y
futex: Replace spin_is_locked() with lockdep
locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y
x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs
x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs
x86/extable: Macrofy inline assembly code to work around GCC inlining bugs
x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops
x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs
x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs
x86/refcount: Work around GCC inlining bug
x86/objtool: Use asm macros to work around GCC inlining bugs
...
Pull power management updates from Rafael Wysocki:
"These make hibernation on 32-bit x86 systems work in all of the cases
in which it works on 64-bit x86 ones, update the menu cpuidle governor
and the "polling" state to make them more efficient, add more hardware
support to cpufreq drivers and fix issues with some of them, fix a bug
in the conservative cpufreq governor, fix the operating performance
points (OPP) framework and make it more stable, update the devfreq
subsystem to take changes in the APIs used by into account and clean
up some things all over.
Specifics:
- Backport hibernation bug fixes from x86-64 to x86-32 and
consolidate hibernation handling on x86 to allow 32-bit systems to
work in all of the cases in which 64-bit ones work (Zhimin Gu, Chen
Yu).
- Fix hibernation documentation (Vladimir D. Seleznev).
- Update the menu cpuidle governor to fix a couple of issues with it,
make it more efficient in some cases and clean it up (Rafael
Wysocki).
- Rework the cpuidle polling state implementation to make it more
efficient (Rafael Wysocki).
- Clean up the cpuidle core somewhat (Fieah Lim).
- Fix the cpufreq conservative governor to take policy limits into
account properly in some cases (Rafael Wysocki).
- Add support for retrieving guaranteed performance information to
the ACPI CPPC library and make the intel_pstate driver use it to
expose the CPU base frequency via sysfs on systems with the
hardware-managed P-states (HWP) feature enabled (Srinivas
Pandruvada).
- Fix clang warning in the CPPC cpufreq driver (Nathan Chancellor).
- Get rid of device_node.name printing from cpufreq (Rob Herring).
- Remove unnecessary unlikely() from the cpufreq core (Igor Stoppa).
- Add support for the r8a7744 SoC to the cpufreq-dt driver (Biju
Das).
- Update the dt-platdev cpufreq driver to allow RK3399 to have
separate tunables per cluster (Dmitry Torokhov).
- Fix the dma_alloc_coherent() usage in the tegra186 cpufreq driver
(Christoph Hellwig).
- Make the imx6q cpufreq driver read OCOTP through nvmem for
imx6ul/imx6ull (Anson Huang).
- Fix several bugs in the operating performance points (OPP)
framework and make it more stable (Viresh Kumar, Dave Gerlach).
- Update the devfreq subsystem to take changes in the APIs used by
into account, fix some issues with it and make it stop print
device_node.name directly (Bjorn Andersson, Enric Balletbo i Serra,
Matthias Kaehlcke, Rob Herring, Vincent Donnefort, zhong jiang).
- Prepare the generic power domains (genpd) framework for dealing
with domains containing CPUs (Ulf Hansson).
- Prevent sysfs attributes representing low-power S0 residency
counters from being exposed if low-power S0 support is not
indicated in ACPI FADT (Rajneesh Bhardwaj).
- Get rid of custom CPU features macros for Intel CPUs from the
intel_idle and RAPL drivers (Andy Shevchenko).
- Update the tasks freezer to list tasks that refused to freeze and
caused a system transition to a sleep state to be aborted (Todd
Brandt).
- Update the pm-graph set of tools to v5.2 (Todd Brandt).
- Fix some issues in the cpupower utility (Anders Roxell, Prarit
Bhargava)"
* tag 'pm-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (73 commits)
PM / Domains: Document flags for genpd
PM / Domains: Deal with multiple states but no governor in genpd
PM / Domains: Don't treat zero found compatible idle states as an error
cpuidle: menu: Avoid computations when result will be discarded
cpuidle: menu: Drop redundant comparison
cpufreq: tegra186: don't pass GFP_DMA32 to dma_alloc_coherent()
cpufreq: conservative: Take limits changes into account properly
Documentation: intel_pstate: Add base_frequency information
cpufreq: intel_pstate: Add base_frequency attribute
ACPI / CPPC: Add support for guaranteed performance
cpuidle: menu: Simplify checks related to the polling state
PM / tools: sleepgraph and bootgraph: upgrade to v5.2
PM / tools: sleepgraph: first batch of v5.2 changes
cpupower: Fix coredump on VMWare
cpupower: Fix AMD Family 0x17 msr_pstate size
cpufreq: imx6q: read OCOTP through nvmem for imx6ul/imx6ull
cpufreq: dt-platdev: allow RK3399 to have separate tunables per cluster
cpuidle: poll_state: Revise loop termination condition
cpuidle: menu: Move the latency_req == 0 special case check
cpuidle: menu: Avoid computations for very close timers
...
Pull dma mapping updates from Christoph Hellwig:
"First batch of dma-mapping changes for 4.20.
There will be a second PR as some big changes were only applied just
before the end of the merge window, and I want to give them a few more
days in linux-next.
Summary:
- mostly more consolidation of the direct mapping code, including
converting over hexagon, and merging the coherent and non-coherent
code into a single dma_map_ops instance (me)
- cleanups for the dma_configure/dma_unconfigure callchains (me)
- better handling of dma_masks in odd setups (me, Alexander Duyck)
- better debugging of passing vmalloc address to the DMA API (Stephen
Boyd)
- CMA command line parsing fix (He Zhe)"
* tag 'dma-mapping-4.20' of git://git.infradead.org/users/hch/dma-mapping: (27 commits)
dma-direct: respect DMA_ATTR_NO_WARN
dma-mapping: translate __GFP_NOFAIL to DMA_ATTR_NO_WARN
dma-direct: document the zone selection logic
dma-debug: Check for drivers mapping invalid addresses in dma_map_single()
dma-direct: fix return value of dma_direct_supported
dma-mapping: move dma_default_get_required_mask under ifdef
dma-direct: always allow dma mask <= physiscal memory size
dma-direct: implement complete bus_dma_mask handling
dma-direct: refine dma_direct_alloc zone selection
dma-direct: add an explicit dma_direct_get_required_mask
dma-mapping: make the get_required_mask method available unconditionally
unicore32: remove swiotlb support
Revert "dma-mapping: clear dev->dma_ops in arch_teardown_dma_ops"
dma-mapping: support non-coherent devices in dma_common_get_sgtable
dma-mapping: consolidate the dma mmap implementations
dma-mapping: merge direct and noncoherent ops
dma-mapping: move the dma_coherent flag to struct device
MIPS: don't select DMA_MAYBE_COHERENT from DMA_PERDEV_COHERENT
dma-mapping: add the missing ARCH_HAS_SYNC_DMA_FOR_CPU_ALL declaration
dma-mapping: fix panic caused by passing empty cma command line argument
...
When the last CPU in an rdt_domain goes offline, its rdt_domain struct gets
freed. Current pseudo-locking code is unaware of this scenario and tries to
dereference the freed structure in a few places.
Add checks to prevent pseudo-locking code from doing this.
While further work is needed to seamlessly restore resource groups (not
just pseudo-locking) to their configuration when the domain is brought back
online, the immediate issue of invalid pointers is addressed here.
Fixes: f4e80d67a5 ("x86/intel_rdt: Resctrl files reflect pseudo-locked information")
Fixes: 443810fe61 ("x86/intel_rdt: Create debugfs files for pseudo-locking testing")
Fixes: 746e08590b ("x86/intel_rdt: Create character device exposing pseudo-locked region")
Fixes: 33dc3e410a ("x86/intel_rdt: Make CPU information accessible for pseudo-locked regions")
Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: gavin.hindman@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/231f742dbb7b00a31cc104416860e27dba6b072d.1539384145.git.reinette.chatre@intel.com
* acpi-pm:
ACPI / PM: LPIT: Register sysfs attributes based on FADT
* pm-sleep:
x86-32, hibernate: Adjust in_suspend after resumed on 32bit system
x86-32, hibernate: Set up temporary text mapping for 32bit system
x86-32, hibernate: Switch to relocated restore code during resume on 32bit system
x86-32, hibernate: Switch to original page table after resumed
x86-32, hibernate: Use the page size macro instead of constant value
x86-32, hibernate: Use temp_pgt as the temporary page table
x86, hibernate: Rename temp_level4_pgt to temp_pgt
x86-32, hibernate: Enable CONFIG_ARCH_HIBERNATION_HEADER on 32bit system
x86, hibernate: Extract the common code of 64/32 bit system
x86-32/asm/power: Create stack frames in hibernate_asm_32.S
PM / hibernate: Check the success of generating md5 digest before hibernation
x86, hibernate: Fix nosave_regions setup for hibernation
PM / sleep: Show freezing tasks that caused a suspend abort
PM / hibernate: Documentation: fix image_size default value
Clang warns that the declaration of jiffies in include/linux/jiffies.h
doesn't match the definition in arch/x86/time/kernel.c:
arch/x86/kernel/time.c:29:42: warning: section does not match previous declaration [-Wsection]
__visible volatile unsigned long jiffies __cacheline_aligned = INITIAL_JIFFIES;
^
./include/linux/cache.h:49:4: note: expanded from macro '__cacheline_aligned'
__section__(".data..cacheline_aligned")))
^
./include/linux/jiffies.h:81:31: note: previous attribute is here
extern unsigned long volatile __cacheline_aligned_in_smp __jiffy_arch_data jiffies;
^
./arch/x86/include/asm/cache.h:20:2: note: expanded from macro '__cacheline_aligned_in_smp'
__page_aligned_data
^
./include/linux/linkage.h:39:29: note: expanded from macro '__page_aligned_data'
#define __page_aligned_data __section(.data..page_aligned) __aligned(PAGE_SIZE)
^
./include/linux/compiler_attributes.h:233:56: note: expanded from macro '__section'
#define __section(S) __attribute__((__section__(#S)))
^
1 warning generated.
The declaration was changed in commit 7c30f352c8 ("jiffies.h: declare
jiffies and jiffies_64 with ____cacheline_aligned_in_smp") but wasn't
updated here. Make them match so Clang no longer warns.
Fixes: 7c30f352c8 ("jiffies.h: declare jiffies and jiffies_64 with ____cacheline_aligned_in_smp")
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181013005311.28617-1-natechancellor@gmail.com
Xen PVH guests receive the address of the RSDP table from Xen. In order
to support booting a Xen PVH guest via Grub2 using the standard x86
boot entry we need a way for Grub2 to pass the RSDP address to the
kernel.
For this purpose expand the struct setup_header to hold the physical
address of the RSDP address. Being zero means it isn't specified and
has to be located the legacy way (searching through low memory or
EBDA).
While documenting the new setup_header layout and protocol version
2.14 add the missing documentation of protocol version 2.13.
There are Grub2 versions in several distros with a downstream patch
violating the boot protocol by writing past the end of setup_header.
This requires another update of the boot protocol to enable the kernel
to distinguish between a specified RSDP address and one filled with
garbage by such a broken Grub2.
From protocol 2.14 on Grub2 will write the version it is supporting
(but never a higher value than found to be supported by the kernel)
ored with 0x8000 to the version field of setup_header. This enables
the kernel to know up to which field Grub2 has written information
to. All fields after that are supposed to be clobbered.
Signed-off-by: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boris.ostrovsky@oracle.com
Cc: bp@alien8.de
Cc: corbet@lwn.net
Cc: linux-doc@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/20181010061456.22238-3-jgross@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a new resource group is created it is initialized with a default
allocation that considers which portions of cache are currently
available for sharing across all resource groups or which portions of
cache are currently unused.
If a CDP allocation forms part of a resource group that is in exclusive
mode then it should be ensured that no new allocation overlaps with any
resource that shares the underlying hardware. The current initial
allocation does not take this sharing of hardware into account and
a new allocation in a resource that shares the same
hardware would affect the exclusive resource group.
Fix this by considering the allocation of a peer RDT domain - a RDT
domain sharing the same hardware - as part of the test to determine
which portion of cache is in use and available for use.
Fixes: 95f0b77efa ("x86/intel_rdt: Initialize new resource group with sane defaults")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: tony.luck@intel.com
Cc: jithu.joseph@intel.com
Cc: gavin.hindman@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/b1f7ec08b1695be067de416a4128466d49684317.1538603665.git.reinette.chatre@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The CBM overlap test is used to manage the allocations of RDT resources
where overlap is possible between resource groups. When a resource group
is in exclusive mode then there should be no overlap between resource
groups.
The current overlap test only considers overlap between the same
resources, for example, that usage of a RDT_RESOURCE_L2DATA resource
in one resource group does not overlap with usage of a RDT_RESOURCE_L2DATA
resource in another resource group. The problem with this is that it
allows overlap between a RDT_RESOURCE_L2DATA resource in one resource
group with a RDT_RESOURCE_L2CODE resource in another resource group -
even if both resource groups are in exclusive mode. This is a problem
because even though these appear to be different resources they end up
sharing the same underlying hardware and thus does not fulfill the
user's request for exclusive use of hardware resources.
Fix this by including the CDP peer (if there is one) in every CBM
overlap test. This does not impact the overlap between resources
within the same exclusive resource group that is allowed.
Fixes: 49f7b4efa1 ("x86/intel_rdt: Enable setting of exclusive mode")
Reported-by: Jithu Joseph <jithu.joseph@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jithu Joseph <jithu.joseph@intel.com>
Acked-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: tony.luck@intel.com
Cc: gavin.hindman@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/e538b7f56f7ca15963dce2e00ac3be8edb8a68e1.1538603665.git.reinette.chatre@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce a utility that, when provided with a RDT resource and an
instance of this RDT resource (a RDT domain), would return pointers to
the RDT resource and RDT domain that share the same hardware. This is
specific to the CDP resources that share the same hardware.
For example, if a pointer to the RDT_RESOURCE_L2DATA resource (struct
rdt_resource) and a pointer to an instance of this resource (struct
rdt_domain) is provided, then it will return a pointer to the
RDT_RESOURCE_L2CODE resource as well as the specific instance that
shares the same hardware as the provided rdt_domain.
This utility is created in support of the "exclusive" resource group
mode where overlap of resource allocation between resource groups need
to be avoided. The overlap test need to consider not just the matching
resources, but also the resources that share the same hardware.
Temporarily mark it as unused in support of patch testing to avoid
compile warnings until it is used.
Fixes: 49f7b4efa1 ("x86/intel_rdt: Enable setting of exclusive mode")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jithu Joseph <jithu.joseph@intel.com>
Acked-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: tony.luck@intel.com
Cc: gavin.hindman@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/9b4bc4d59ba2e903b6a3eb17e16ef41a8e7b7c3e.1538603665.git.reinette.chatre@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While the DOC at the beginning of lib/bitmap.c explicitly states that
"The number of valid bits in a given bitmap does _not_ need to be an
exact multiple of BITS_PER_LONG.", some of the bitmap operations do
indeed access BITS_PER_LONG portions of the provided bitmap no matter
the size of the provided bitmap. For example, if bitmap_intersects()
is provided with an 8 bit bitmap the operation will access
BITS_PER_LONG bits from the provided bitmap. While the operation
ensures that these extra bits do not affect the result, the memory
is still accessed.
The capacity bitmasks (CBMs) are typically stored in u32 since they
can never exceed 32 bits. A few instances exist where a bitmap_*
operation is performed on a CBM by simply pointing the bitmap operation
to the stored u32 value.
The consequence of this pattern is that some bitmap_* operations will
access out-of-bounds memory when interacting with the provided CBM. This
is confirmed with a KASAN test that reports:
BUG: KASAN: stack-out-of-bounds in __bitmap_intersects+0xa2/0x100
and
BUG: KASAN: stack-out-of-bounds in __bitmap_weight+0x58/0x90
Fix this by moving any CBM provided to a bitmap operation needing
BITS_PER_LONG to an 'unsigned long' variable.
[ tglx: Changed related function arguments to unsigned long and got rid
of the _cbm extra step ]
Fixes: 72d5050566 ("x86/intel_rdt: Add utilities to test pseudo-locked region possibility")
Fixes: 49f7b4efa1 ("x86/intel_rdt: Enable setting of exclusive mode")
Fixes: d9b48c86eb ("x86/intel_rdt: Display resource groups' allocations' size in bytes")
Fixes: 95f0b77efa ("x86/intel_rdt: Initialize new resource group with sane defaults")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/69a428613a53f10e80594679ac726246020ff94f.1538686926.git.reinette.chatre@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have a special segment descriptor entry in the GDT, whose sole purpose is to
encode the CPU and node numbers in its limit (size) field. There are user-space
instructions that allow the reading of the limit field, which gives us a really
fast way to read the CPU and node IDs from the vDSO for example.
But the naming of related functionality does not make this clear, at all:
VDSO_CPU_SIZE
VDSO_CPU_MASK
__CPU_NUMBER_SEG
GDT_ENTRY_CPU_NUMBER
vdso_encode_cpu_node
vdso_read_cpu_node
There's a number of problems:
- The 'VDSO_CPU_SIZE' doesn't really make it clear that these are number
of bits, nor does it make it clear which 'CPU' this refers to, i.e.
that this is about a GDT entry whose limit encodes the CPU and node number.
- Furthermore, the 'CPU_NUMBER' naming is actively misleading as well,
because the segment limit encodes not just the CPU number but the
node ID as well ...
So use a better nomenclature all around: name everything related to this trick
as 'CPUNODE', to make it clear that this is something special, and add
_BITS to make it clear that these are number of bits, and propagate this to
every affected name:
VDSO_CPU_SIZE => VDSO_CPUNODE_BITS
VDSO_CPU_MASK => VDSO_CPUNODE_MASK
__CPU_NUMBER_SEG => __CPUNODE_SEG
GDT_ENTRY_CPU_NUMBER => GDT_ENTRY_CPUNODE
vdso_encode_cpu_node => vdso_encode_cpunode
vdso_read_cpu_node => vdso_read_cpunode
This, beyond being less confusing, also makes it easier to grep for all related
functionality:
$ git grep -i cpunode arch/x86
Also, while at it, fix "return is not a function" style sloppiness in vdso_encode_cpunode().
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chang S. Bae <chang.seok.bae@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Markus T Metzger <markus.t.metzger@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1537312139-5580-2-git-send-email-chang.seok.bae@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce FS/GS base access functionality via <asm/fsgsbase.h>,
not yet used by anything directly.
Factor out task_seg_base() from x86/ptrace.c and rename it to
x86_fsgsbase_read_task() to make it part of the new helpers.
This will allow us to enhance FSGSBASE support and eventually enable
the FSBASE/GSBASE instructions.
An "inactive" GS base refers to a base saved at kernel entry
and being part of an inactive, non-running/stopped user-task.
(The typical ptrace model.)
Here are the new functions:
x86_fsbase_read_task()
x86_gsbase_read_task()
x86_fsbase_write_task()
x86_gsbase_write_task()
x86_fsbase_read_cpu()
x86_fsbase_write_cpu()
x86_gsbase_read_cpu_inactive()
x86_gsbase_write_cpu_inactive()
As an advantage of the unified namespace we can now see all FS/GSBASE
API use in the kernel via the following 'git grep' pattern:
$ git grep x86_.*sbase
[ mingo: Wrote new changelog. ]
Based-on-code-from: Andy Lutomirski <luto@kernel.org>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Markus T Metzger <markus.t.metzger@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537312139-5580-3-git-send-email-chang.seok.bae@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As described in:
77b0bf55bc: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")
GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.
The workaround is to set an assembly macro and call it from the inline
assembly block. As a result GCC considers the inline assembly block as
a single instruction. (Which it isn't, but that's the best we can get.)
In this patch we wrap the paravirt call section tricks in a macro,
to hide it from GCC.
The effect of the patch is a more aggressive inlining, which also
causes a size increase of kernel.
text data bss dec hex filename
18147336 10226688 2957312 31331336 1de1408 ./vmlinux before
18162555 10226288 2957312 31346155 1de4deb ./vmlinux after (+14819)
The number of static text symbols (non-inlined functions) goes down:
Before: 40053
After: 39942 (-111)
[ mingo: Rewrote the changelog. ]
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20181003213100.189959-8-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As described in:
77b0bf55bc: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")
GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.
The workaround is to set an assembly macro and call it from the inline
assembly block. As a result GCC considers the inline assembly block as
a single instruction. (Which it isn't, but that's the best we can get.)
This patch increases the kernel size:
text data bss dec hex filename
18146889 10225380 2957312 31329581 1de0d2d ./vmlinux before
18147336 10226688 2957312 31331336 1de1408 ./vmlinux after (+1755)
But enables more aggressive inlining (and probably better branch decisions).
The number of static text symbols in vmlinux is much lower:
Before: 40218
After: 40053 (-165)
The assembly code gets harder to read due to the extra macro layer.
[ mingo: Rewrote the changelog. ]
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181003213100.189959-7-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As described in:
77b0bf55bc: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")
GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.
The workaround is to set an assembly macro and call it from the inline
assembly block - i.e. to macrify the affected block.
As a result GCC considers the inline assembly block as a single instruction.
This patch handles the LOCK prefix, allowing more aggresive inlining:
text data bss dec hex filename
18140140 10225284 2957312 31322736 1ddf270 ./vmlinux before
18146889 10225380 2957312 31329581 1de0d2d ./vmlinux after (+6845)
This is the reduction in non-inlined functions:
Before: 40286
After: 40218 (-68)
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181003213100.189959-6-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As described in:
77b0bf55bc: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")
GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.
The workaround is to set an assembly macro and call it from the inline
assembly block. As a result GCC considers the inline assembly block as
a single instruction. (Which it isn't, but that's the best we can get.)
This patch allows GCC to inline simple functions such as __get_seccomp_filter().
To no-one's surprise the result is that GCC performs more aggressive (read: correct)
inlining decisions in these senarios, which reduces the kernel size and presumably
also speeds it up:
text data bss dec hex filename
18140970 10225412 2957312 31323694 1ddf62e ./vmlinux before
18140140 10225284 2957312 31322736 1ddf270 ./vmlinux after (-958)
16 fewer static text symbols:
Before: 40302
After: 40286 (-16)
these got inlined instead.
Functions such as kref_get(), free_user(), fuse_file_get() now get inlined. Hurray!
[ mingo: Rewrote the changelog. ]
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181003213100.189959-5-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As described in:
77b0bf55bc: ("kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs")
GCC's inlining heuristics are broken with common asm() patterns used in
kernel code, resulting in the effective disabling of inlining.
In the case of objtool the resulting borkage can be significant, since all the
annotations of objtool are discarded during linkage and never inlined,
yet GCC bogusly considers most functions affected by objtool annotations
as 'too large'.
The workaround is to set an assembly macro and call it from the inline
assembly block. As a result GCC considers the inline assembly block as
a single instruction. (Which it isn't, but that's the best we can get.)
This increases the kernel size slightly:
text data bss dec hex filename
18140829 10224724 2957312 31322865 1ddf2f1 ./vmlinux before
18140970 10225412 2957312 31323694 1ddf62e ./vmlinux after (+829)
The number of static text symbols (i.e. non-inlined functions) is reduced:
Before: 40321
After: 40302 (-19)
[ mingo: Rewrote the changelog. ]
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Christopher Li <sparse@chrisli.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-sparse@vger.kernel.org
Link: http://lkml.kernel.org/r/20181003213100.189959-4-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Using macros in inline assembly allows us to work around bugs
in GCC's inlining decisions.
Compile macros.S and use it to assemble all C files.
Currently only x86 will use it.
Background:
The inlining pass of GCC doesn't include an assembler, so it's not aware
of basic properties of the generated code, such as its size in bytes,
or that there are such things as discontiuous blocks of code and data
due to the newfangled linker feature called 'sections' ...
Instead GCC uses a lazy and fragile heuristic: it does a linear count of
certain syntactic and whitespace elements in inlined assembly block source
code, such as a count of new-lines and semicolons (!), as a poor substitute
for "code size and complexity".
Unsurprisingly this heuristic falls over and breaks its neck whith certain
common types of kernel code that use inline assembly, such as the frequent
practice of putting useful information into alternative sections.
As a result of this fresh, 20+ years old GCC bug, GCC's inlining decisions
are effectively disabled for inlined functions that make use of such asm()
blocks, because GCC thinks those sections of code are "large" - when in
reality they are often result in just a very low number of machine
instructions.
This absolute lack of inlining provess when GCC comes across such asm()
blocks both increases generated kernel code size and causes performance
overhead, which is particularly noticeable on paravirt kernels, which make
frequent use of these inlining facilities in attempt to stay out of the
way when running on baremetal hardware.
Instead of fixing the compiler we use a workaround: we set an assembly macro
and call it from the inlined assembly block. As a result GCC considers the
inline assembly block as a single instruction. (Which it often isn't but I digress.)
This uglifies and bloats the source code - for example just the refcount
related changes have this impact:
Makefile | 9 +++++++--
arch/x86/Makefile | 7 +++++++
arch/x86/kernel/macros.S | 7 +++++++
scripts/Kbuild.include | 4 +++-
scripts/mod/Makefile | 2 ++
5 files changed, 26 insertions(+), 3 deletions(-)
Yay readability and maintainability, it's not like assembly code is hard to read
and maintain ...
We also hope that GCC will eventually get fixed, but we are not holding
our breath for that. Yet we are optimistic, it might still happen, any decade now.
[ mingo: Wrote new changelog describing the background. ]
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kbuild@vger.kernel.org
Link: http://lkml.kernel.org/r/20181003213100.189959-3-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In resctrl filesystem, mount options exist to enable L3/L2 CDP and MBA
Software Controller features if the platform supports them:
mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl
But currently only "cdp" option is displayed in /proc/mounts. "cdpl2" and
"mba_MBps" options are not shown even when they are active.
Before:
# mount -t resctrl resctrl -o cdp,mba_MBps /sys/fs/resctrl
# grep resctrl /proc/mounts
/sys/fs/resctrl /sys/fs/resctrl resctrl rw,relatime,cdp 0 0
After:
# mount -t resctrl resctrl -o cdp,mba_MBps /sys/fs/resctrl
# grep resctrl /proc/mounts
/sys/fs/resctrl /sys/fs/resctrl resctrl rw,relatime,cdp,mba_MBps 0 0
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H Peter Anvin" <hpa@zytor.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1536796118-60135-1-git-send-email-fenghua.yu@intel.com