Commit Graph

129978 Commits

Author SHA1 Message Date
John David Anglin
c0452fb9fb parisc: Fix race in pci-dma.c
We are still troubled by occasional random segmentation faults and
memory memory corruption on SMP machines.  The causes quite a few
package builds to fail on the Debian buildd machines for parisc.  When
gcc-6 failed to build three times in a row, I looked again at the TLB
related code.  I found a couple of issues.  This is the first.

In general, we need to ensure page table updates and corresponding TLB
purges are atomic.  The attached patch fixes an instance in pci-dma.c
where the page table update was not guarded by the TLB lock.

Tested on rp3440 and c8000.  So far, no further random segmentation
faults have been observed.

Signed-off-by: John David Anglin  <dave.anglin@bell.net>
Cc: <stable@vger.kernel.org> # v3.16+
Signed-off-by: Helge Deller <deller@gmx.de>
2016-11-25 12:31:59 +01:00
Helge Deller
43b1f6abd5 parisc: Switch to generic sched_clock implementation
Drop the open-coded sched_clock() function and replace it by the provided
GENERIC_SCHED_CLOCK implementation.  We have seen quite some hung tasks in the
past, which seem to be fixed by this patch.

Signed-off-by: Helge Deller <deller@gmx.de>
Cc: <stable@vger.kernel.org> # v4.7+
Signed-off-by: Helge Deller <deller@gmx.de>
2016-11-25 12:31:58 +01:00
John David Anglin
741dc7bf1c parisc: Fix races in parisc_setup_cache_timing()
Helge reported to me the following startup crash:

[    0.000000] Linux version 4.8.0-1-parisc64-smp (debian-kernel@lists.debian.org) (gcc version 5.4.1 20161019 (GCC) ) #1 SMP Debian 4.8.7-1 (2016-11-13)
[    0.000000] The 64-bit Kernel has started...
[    0.000000] Kernel default page size is 4 KB. Huge pages enabled with 1 MB physical and 2 MB virtual size.
[    0.000000] Determining PDC firmware type: System Map.
[    0.000000] model 9000/785/J5000
[    0.000000] Total Memory: 2048 MB
[    0.000000] Memory: 2018528K/2097152K available (9272K kernel code, 3053K rwdata, 1319K rodata, 1024K init, 840K bss, 78624K reserved, 0K cma-reserved)
[    0.000000] virtual kernel memory layout:
[    0.000000]     vmalloc : 0x0000000000008000 - 0x000000003f000000   (1007 MB)
[    0.000000]     memory  : 0x0000000040000000 - 0x00000000c0000000   (2048 MB)
[    0.000000]       .init : 0x0000000040100000 - 0x0000000040200000   (1024 kB)
[    0.000000]       .data : 0x0000000040b0e000 - 0x0000000040f533e0   (4372 kB)
[    0.000000]       .text : 0x0000000040200000 - 0x0000000040b0e000   (9272 kB)
[    0.768910] Brought up 1 CPUs
[    0.992465] NET: Registered protocol family 16
[    2.429981] Releasing cpu 1 now, hpa=fffffffffffa2000
[    2.635751] CPU(s): 2 out of 2 PA8500 (PCX-W) at 440.000000 MHz online
[    2.726692] Setting cache flush threshold to 1024 kB
[    2.729932] Not-handled unaligned insn 0x43ffff80
[    2.798114] Setting TLB flush threshold to 140 kB
[    2.928039] Unaligned handler failed, ret = -1
[    3.000419]       _______________________________
[    3.000419]      < Your System ate a SPARC! Gah! >
[    3.000419]       -------------------------------
[    3.000419]              \   ^__^
[    3.000419]                  (__)\       )\/\
[    3.000419]                   U  ||----w |
[    3.000419]                      ||     ||
[    9.340055] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.8.0-1-parisc64-smp #1 Debian 4.8.7-1
[    9.448082] task: 00000000bfd48060 task.stack: 00000000bfd50000
[    9.528040]
[   10.760029] IASQ: 0000000000000000 0000000000000000 IAOQ: 000000004025d154 000000004025d158
[   10.868052]  IIR: 43ffff80    ISR: 0000000000340000  IOR: 000001ff54150960
[   10.960029]  CPU:        1   CR30: 00000000bfd50000 CR31: 0000000011111111
[   11.052057]  ORIG_R28: 000000004021e3b4
[   11.100045]  IAOQ[0]: irq_exit+0x94/0x120
[   11.152062]  IAOQ[1]: irq_exit+0x98/0x120
[   11.208031]  RP(r2): irq_exit+0xb8/0x120
[   11.256074] Backtrace:
[   11.288067]  [<00000000402cd944>] cpu_startup_entry+0x1e4/0x598
[   11.368058]  [<0000000040109528>] smp_callin+0x2c0/0x2f0
[   11.436308]  [<00000000402b53fc>] update_curr+0x18c/0x2d0
[   11.508055]  [<00000000402b73b8>] dequeue_entity+0x2c0/0x1030
[   11.584040]  [<00000000402b3cc0>] set_next_entity+0x80/0xd30
[   11.660069]  [<00000000402c1594>] pick_next_task_fair+0x614/0x720
[   11.740085]  [<000000004020dd34>] __schedule+0x394/0xa60
[   11.808054]  [<000000004020e488>] schedule+0x88/0x118
[   11.876039]  [<0000000040283d3c>] rescuer_thread+0x4d4/0x5b0
[   11.948090]  [<000000004028fc4c>] kthread+0x1ec/0x248
[   12.016053]  [<0000000040205020>] end_fault_vector+0x20/0xc0
[   12.092239]  [<00000000402050c0>] _switch_to_ret+0x0/0xf40
[   12.164044]
[   12.184036] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.8.0-1-parisc64-smp #1 Debian 4.8.7-1
[   12.244040] Backtrace:
[   12.244040]  [<000000004021c480>] show_stack+0x68/0x80
[   12.244040]  [<00000000406f332c>] dump_stack+0xec/0x168
[   12.244040]  [<000000004021c74c>] die_if_kernel+0x25c/0x430
[   12.244040]  [<000000004022d320>] handle_unaligned+0xb48/0xb50
[   12.244040]
[   12.632066] ---[ end trace 9ca05a7215c7bbb2 ]---
[   12.692036] Kernel panic - not syncing: Attempted to kill the idle task!

We have the insn 0x43ffff80 in IIR but from IAOQ we should have:
   4025d150:   0f f3 20 df     ldd,s r19(r31),r31
   4025d154:   0f 9f 00 9c     ldw r31(ret0),ret0
   4025d158:   bf 80 20 58     cmpb,*<> r0,ret0,4025d18c <irq_exit+0xcc>

Cpu0 has just completed running parisc_setup_cache_timing:

[    2.429981] Releasing cpu 1 now, hpa=fffffffffffa2000
[    2.635751] CPU(s): 2 out of 2 PA8500 (PCX-W) at 440.000000 MHz online
[    2.726692] Setting cache flush threshold to 1024 kB
[    2.729932] Not-handled unaligned insn 0x43ffff80
[    2.798114] Setting TLB flush threshold to 140 kB
[    2.928039] Unaligned handler failed, ret = -1

From the backtrace, cpu1 is in smp_callin:

void __init smp_callin(void)
{
       int slave_id = cpu_now_booting;

       smp_cpu_init(slave_id);
       preempt_disable();

       flush_cache_all_local(); /* start with known state */
       flush_tlb_all_local(NULL);

       local_irq_enable();  /* Interrupts have been off until now */

       cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);

So, it has just flushed its caches and the TLB. It would seem either the
flushes in parisc_setup_cache_timing or smp_callin have corrupted kernel
memory.

The attached patch reworks parisc_setup_cache_timing to remove the races
in setting the cache and TLB flush thresholds. It also corrects the
number of bytes flushed in the TLB calculation.

The patch flushes the cache and TLB on cpu0 before starting the
secondary processors so that they are started from a known state.

Tested with a few reboots on c8000.

Signed-off-by: John David Anglin  <dave.anglin@bell.net>
Cc: <stable@vger.kernel.org> # v3.18+
Signed-off-by: Helge Deller <deller@gmx.de>
2016-11-25 12:31:57 +01:00
Matt Redfearn
2a872a5dce MIPS: mm: Fix output of __do_page_fault
Since commit 4bcc595ccd ("printk: reinstate KERN_CONT for printing
continuation lines") the output from __do_page_fault on MIPS has been
pretty unreadable due to the lack of KERN_CONT markers. Use pr_cont
to provide the appropriate markers & restore the expected output.

Signed-off-by: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/14544/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2016-11-25 12:08:10 +01:00
Jisheng Zhang
40fdc6b0d2 arm64: dts: berlin4ct-dmp: add missing unit name to /memory node
This patch fixes the following DTC warning with W=1:

"Node /memory has a reg or ranges property, but no unit name"

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
2016-11-25 17:14:00 +08:00
Jisheng Zhang
c71aa0e200 arm64: dts: berlin4ct-stb: add missing unit name to /memory node
This patch fixes the following DTC warning with W=1:

"Node /memory has a reg or ranges property, but no unit name"

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
2016-11-25 17:13:53 +08:00
Jisheng Zhang
47d56462fc arm64: dts: berlin4ct: add missing unit name to /soc node
This patch fixes the following DTC warning with W=1:

"Node /soc has a reg or ranges property, but no unit name"

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
2016-11-25 17:13:44 +08:00
Martin Schwidefsky
61aaef51cc s390: fix kernel oops for CONFIG_MARCH_Z900=y builds
The LAST_BREAK macro in entry.S uses a different instruction sequence
for CONFIG_MARCH_Z900 builds. The branch target offset to skip the
store of the last breaking event address needs to take the different
length of the code block into account.

Fixes: f8fc82b471 ("s390: move sys_call_table and last_break from thread_info to thread_struct")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-11-25 10:07:55 +01:00
Masahiro Yamada
0248994d7c ARM: dts: berlin2q-marvell-dmp: fix typo in chosen node
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
2016-11-25 16:52:18 +08:00
Jisheng Zhang
d289fcc888 ARM: dts: berlin2q-marvell-dmp: fix regulators' name
This patch fixes the following DTC warnings with W=1:

Warning (unit_address_vs_reg): Node /regulators/regulator@0 has a unit
name, but no reg property
Warning (unit_address_vs_reg): Node /regulators/regulator@1 has a unit
name, but no reg property
Warning (unit_address_vs_reg): Node /regulators/regulator@2 has a unit
name, but no reg property
Warning (unit_address_vs_reg): Node /regulators/regulator@3 has a unit
name, but no reg property
Warning (unit_address_vs_reg): Node /regulators/regulator@4 has a unit
name, but no reg property

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
2016-11-25 16:50:20 +08:00
Borislav Petkov
9b032d21f6 x86/boot/64: Use defines for page size
... instead of naked numbers like the rest of the asm does in this file.

No code changed:

  # arch/x86/kernel/head_64.o:

   text    data     bss     dec     hex filename
   1124  290864    4096  296084   48494 head_64.o.before
   1124  290864    4096  296084   48494 head_64.o.after

md5:
   87086e202588939296f66e892414ffe2  head_64.o.before.asm
   87086e202588939296f66e892414ffe2  head_64.o.after.asm

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161124210550.15025-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-25 07:11:29 +01:00
Balbir Singh
1d18ad0268 powerpc/mm: Detect instruction fetch denied and report
ISA 3 allows for prevention of instruction fetch and execution
of user mode pages. If such an error occurs, SRR1 bit 35 reports the
error. We catch and report the error in do_page_fault().

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-25 15:01:35 +11:00
Balbir Singh
ee97b6b99f powerpc/mm/radix: Setup AMOR in HV mode to allow key 0
Setup AMOR (Authority Mask Override Register) in HV mode so that the
host and guest kernel can in turn setup IAMR.

This allows us to enable key 0 in a following patch.

Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-25 15:01:31 +11:00
Gautham R. Shenoy
378f96d3cd powernv: Clear SPRN_PSSCR when a POWER9 CPU comes online
Ensure that PSSCR is set to a safe value corresponding to no
state-loss each time a POWER9 CPU comes online.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-By: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-25 14:37:04 +11:00
Michael Ellerman
56144ec7c9 powerpc/xmon: Add 'dt' command to dump trace buffers
There is a nice interface for asking ftrace to dump all its tracing
buffers. The only down side for use in xmon is that it uses printk.
Depending on circumstances printk may not work when in xmon, but it also
may, so add a 'dt' command which dumps the ftrace buffers, and add a
note to the help to mentiont that it uses printk.

Calling this routine also disables tracing, which is problematic if you
return from xmon and expect the system to keep operating normally. So
after we do the dump turn tracing back on.

Both functions already have nop versions defined for when ftrace is not
enabled, so we don't need any extra #ifdefs.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-25 14:30:27 +11:00
Aneesh Kumar K.V
984d7a1ec6 powerpc/mm: Fixup kernel read only mapping
With commit e58e87adc8 ("powerpc/mm: Update _PAGE_KERNEL_RO") we
started using the ppp value 0b110 to map kernel readonly. But that
facility was only added as part of ISA 2.04. For earlier ISA version
only supported ppp bit value for readonly mapping is 0b011. (This
implies both user and kernel get mapped using the same ppp bit value for
readonly mapping.).
Update the code such that for earlier architecture version we use ppp
value 0b011 for readonly mapping. We don't differentiate between power5+
and power5 here and apply the new ppp bits only from power6 (ISA 2.05).
This keep the changes minimal.

This fixes issue with PS3 spu usage reported at
https://lkml.kernel.org/r/rep.1421449714.geoff@infradead.org

Fixes: e58e87adc8 ("powerpc/mm: Update _PAGE_KERNEL_RO")
Cc: stable@vger.kernel.org # v4.7+
Tested-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-25 14:18:25 +11:00
Geliang Tang
ebb242d56b powerpc/of_platform: Use builtin_platform_driver
Use builtin_platform_driver() helper to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-25 14:07:51 +11:00
Michael Ellerman
da58b23cb9 powerpc: Fix __cmpxchg() to take a volatile ptr again
In commit d0563a1297 ("powerpc: Implement {cmp}xchg for u8 and u16")
we removed the volatile from __cmpxchg().

This is leading to warnings such as:

  drivers/gpu/drm/drm_lock.c: In function ‘drm_lock_take’:
  arch/powerpc/include/asm/cmpxchg.h:484:37: warning: passing argument 1
  of ‘__cmpxchg’ discards ‘volatile’ qualifier from pointer target
     (__typeof__(*(ptr))) __cmpxchg((ptr), (unsigned long)_o_,   \

There doesn't seem to be consensus across architectures whether the
argument is volatile or not, so at least for now put the volatile back.

Fixes: d0563a1297 ("powerpc: Implement {cmp}xchg for u8 and u16")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-25 14:07:50 +11:00
Tim Chen
d3d37d850d x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU
Some Intel cores in a package can be boosted to a higher turbo frequency
with ITMT 3.0 technology. The scheduler can use the asymmetric packing
feature to move tasks to the more capable cores.

If ITMT is enabled, add SD_ASYM_PACKING flag to the thread and core
sched domains to enable asymmetric packing.

Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: linux-pm@vger.kernel.org
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/9bbb885bedbef4eb50e197305eb16b160cff0831.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-24 20:44:20 +01:00
Tim Chen
f9793e3495 x86/sysctl: Add sysctl for ITMT scheduling feature
Intel Turbo Boost Max Technology 3.0 (ITMT) feature
allows some cores to be boosted to higher turbo
frequency than others.

Add /proc/sys/kernel/sched_itmt_enabled so operator
can enable/disable scheduling of tasks that favor cores
with higher turbo boost frequency potential.

By default, system that is ITMT capable and single
socket has this feature turned on.  It is more likely
to be lightly loaded and operates in Turbo range.

When there is a change in the ITMT scheduling operation
desired, a rebuild of the sched domain is initiated
so the scheduler can set up sched domains with appropriate
flag to enable/disable ITMT scheduling operations.

Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: linux-pm@vger.kernel.org
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/07cc62426a28bad57b01ab16bb903a9c84fa5421.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-24 20:44:19 +01:00
Tim Chen
5e76b2ab36 x86: Enable Intel Turbo Boost Max Technology 3.0
On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum
turbo frequencies of some cores in a CPU package may be higher than for
the other cores in the same package.  In that case, better performance
(and possibly lower energy consumption as well) can be achieved by
making the scheduler prefer to run tasks on the CPUs with higher max
turbo frequencies.

To that end, set up a core priority metric to abstract the core
preferences based on the maximum turbo frequency.  In that metric,
the cores with higher maximum turbo frequencies are higher-priority
than the other cores in the same package and that causes the scheduler
to favor them when making load-balancing decisions using the asymmertic
packing approach.  At the same time, the priority of SMT threads with a
higher CPU number is reduced so as to avoid scheduling tasks on all of
the threads that belong to a favored core before all of the other cores
have been given a task to run.

The priority metric will be initialized by the P-state driver with the
help of the sched_set_itmt_core_prio() function.  The P-state driver
will also determine whether or not ITMT is supported by the platform
and will call sched_set_itmt_support() to indicate that.

Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: linux-pm@vger.kernel.org
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-24 20:44:19 +01:00
Tim Chen
7d25127cef x86/topology: Define x86's arch_update_cpu_topology
The scheduler calls arch_update_cpu_topology() to check whether the
scheduler domains have to be rebuilt.

So far x86 has no requirement for this, but the upcoming ITMT support
makes this necessary.

Request the rebuild when the x86 internal update flag is set.

Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: linux-pm@vger.kernel.org
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/bfbf5591276ec60b2af2da798adc1060df1e2a5f.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-24 20:44:19 +01:00
Radim Krčmář
df492896e6 KVM: x86: check for pic and ioapic presence before use
Split irqchip allows pic and ioapic routes to be used without them being
created, which results in NULL access.  Check for NULL and avoid it.
(The setup is too racy for a nicer solutions.)

Found by syzkaller:

  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Modules linked in:
  CPU: 3 PID: 11923 Comm: kworker/3:2 Not tainted 4.9.0-rc5+ #27
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: events irqfd_inject
  task: ffff88006a06c7c0 task.stack: ffff880068638000
  RIP: 0010:[...]  [...] __lock_acquire+0xb35/0x3380 kernel/locking/lockdep.c:3221
  RSP: 0000:ffff88006863ea20  EFLAGS: 00010006
  RAX: dffffc0000000000 RBX: dffffc0000000000 RCX: 0000000000000000
  RDX: 0000000000000039 RSI: 0000000000000000 RDI: 1ffff1000d0c7d9e
  RBP: ffff88006863ef58 R08: 0000000000000001 R09: 0000000000000000
  R10: 00000000000001c8 R11: 0000000000000000 R12: ffff88006a06c7c0
  R13: 0000000000000001 R14: ffffffff8baab1a0 R15: 0000000000000001
  FS:  0000000000000000(0000) GS:ffff88006d100000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000004abdd0 CR3: 000000003e2f2000 CR4: 00000000000026e0
  Stack:
   ffffffff894d0098 1ffff1000d0c7d56 ffff88006863ecd0 dffffc0000000000
   ffff88006a06c7c0 0000000000000000 ffff88006863ecf8 0000000000000082
   0000000000000000 ffffffff815dd7c1 ffffffff00000000 ffffffff00000000
  Call Trace:
   [...] lock_acquire+0x2a2/0x790 kernel/locking/lockdep.c:3746
   [...] __raw_spin_lock include/linux/spinlock_api_smp.h:144
   [...] _raw_spin_lock+0x38/0x50 kernel/locking/spinlock.c:151
   [...] spin_lock include/linux/spinlock.h:302
   [...] kvm_ioapic_set_irq+0x4c/0x100 arch/x86/kvm/ioapic.c:379
   [...] kvm_set_ioapic_irq+0x8f/0xc0 arch/x86/kvm/irq_comm.c:52
   [...] kvm_set_irq+0x239/0x640 arch/x86/kvm/../../../virt/kvm/irqchip.c:101
   [...] irqfd_inject+0xb4/0x150 arch/x86/kvm/../../../virt/kvm/eventfd.c:60
   [...] process_one_work+0xb40/0x1ba0 kernel/workqueue.c:2096
   [...] worker_thread+0x214/0x18a0 kernel/workqueue.c:2230
   [...] kthread+0x328/0x3e0 kernel/kthread.c:209
   [...] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: 49df6397ed ("KVM: x86: Split the APIC from the rest of IRQCHIP.")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:39:28 +01:00
Radim Krčmář
81cdb259fb KVM: x86: fix out-of-bounds accesses of rtc_eoi map
KVM was using arrays of size KVM_MAX_VCPUS with vcpu_id, but ID can be
bigger that the maximal number of VCPUs, resulting in out-of-bounds
access.

Found by syzkaller:

  BUG: KASAN: slab-out-of-bounds in __apic_accept_irq+0xb33/0xb50 at addr [...]
  Write of size 1 by task a.out/27101
  CPU: 1 PID: 27101 Comm: a.out Not tainted 4.9.0-rc5+ #49
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
   [...]
  Call Trace:
   [...] __apic_accept_irq+0xb33/0xb50 arch/x86/kvm/lapic.c:905
   [...] kvm_apic_set_irq+0x10e/0x180 arch/x86/kvm/lapic.c:495
   [...] kvm_irq_delivery_to_apic+0x732/0xc10 arch/x86/kvm/irq_comm.c:86
   [...] ioapic_service+0x41d/0x760 arch/x86/kvm/ioapic.c:360
   [...] ioapic_set_irq+0x275/0x6c0 arch/x86/kvm/ioapic.c:222
   [...] kvm_ioapic_inject_all arch/x86/kvm/ioapic.c:235
   [...] kvm_set_ioapic+0x223/0x310 arch/x86/kvm/ioapic.c:670
   [...] kvm_vm_ioctl_set_irqchip arch/x86/kvm/x86.c:3668
   [...] kvm_arch_vm_ioctl+0x1a08/0x23c0 arch/x86/kvm/x86.c:3999
   [...] kvm_vm_ioctl+0x1fa/0x1a70 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3099

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: af1bae5497 ("KVM: x86: bump KVM_MAX_VCPU_ID to 1023")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:37:19 +01:00
Radim Krčmář
2117d5398c KVM: x86: drop error recovery in em_jmp_far and em_ret_far
em_jmp_far and em_ret_far assumed that setting IP can only fail in 64
bit mode, but syzkaller proved otherwise (and SDM agrees).
Code segment was restored upon failure, but it was left uninitialized
outside of long mode, which could lead to a leak of host kernel stack.
We could have fixed that by always saving and restoring the CS, but we
take a simpler approach and just break any guest that manages to fail
as the error recovery is error-prone and modern CPUs don't need emulator
for this.

Found by syzkaller:

  WARNING: CPU: 2 PID: 3668 at arch/x86/kvm/emulate.c:2217 em_ret_far+0x428/0x480
  Kernel panic - not syncing: panic_on_warn set ...

  CPU: 2 PID: 3668 Comm: syz-executor Not tainted 4.9.0-rc4+ #49
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
   [...]
  Call Trace:
   [...] __dump_stack lib/dump_stack.c:15
   [...] dump_stack+0xb3/0x118 lib/dump_stack.c:51
   [...] panic+0x1b7/0x3a3 kernel/panic.c:179
   [...] __warn+0x1c4/0x1e0 kernel/panic.c:542
   [...] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
   [...] em_ret_far+0x428/0x480 arch/x86/kvm/emulate.c:2217
   [...] em_ret_far_imm+0x17/0x70 arch/x86/kvm/emulate.c:2227
   [...] x86_emulate_insn+0x87a/0x3730 arch/x86/kvm/emulate.c:5294
   [...] x86_emulate_instruction+0x520/0x1ba0 arch/x86/kvm/x86.c:5545
   [...] emulate_instruction arch/x86/include/asm/kvm_host.h:1116
   [...] complete_emulated_io arch/x86/kvm/x86.c:6870
   [...] complete_emulated_mmio+0x4e9/0x710 arch/x86/kvm/x86.c:6934
   [...] kvm_arch_vcpu_ioctl_run+0x3b7a/0x5a90 arch/x86/kvm/x86.c:6978
   [...] kvm_vcpu_ioctl+0x61e/0xdd0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2557
   [...] vfs_ioctl fs/ioctl.c:43
   [...] do_vfs_ioctl+0x18c/0x1040 fs/ioctl.c:679
   [...] SYSC_ioctl fs/ioctl.c:694
   [...] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685
   [...] entry_SYSCALL_64_fastpath+0x1f/0xc2

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: d1442d85cc ("KVM: x86: Handle errors when RIP is set during far jumps")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:36:54 +01:00
Radim Krčmář
444fdad88f KVM: x86: fix out-of-bounds access in lapic
Cluster xAPIC delivery incorrectly assumed that dest_id <= 0xff.
With enabled KVM_X2APIC_API_USE_32BIT_IDS in KVM_CAP_X2APIC_API, a
userspace can send an interrupt with dest_id that results in
out-of-bounds access.

Found by syzkaller:

  BUG: KASAN: slab-out-of-bounds in kvm_irq_delivery_to_apic_fast+0x11fa/0x1210 at addr ffff88003d9ca750
  Read of size 8 by task syz-executor/22923
  CPU: 0 PID: 22923 Comm: syz-executor Not tainted 4.9.0-rc4+ #49
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
   [...]
  Call Trace:
   [...] __dump_stack lib/dump_stack.c:15
   [...] dump_stack+0xb3/0x118 lib/dump_stack.c:51
   [...] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156
   [...] print_address_description mm/kasan/report.c:194
   [...] kasan_report_error mm/kasan/report.c:283
   [...] kasan_report+0x231/0x500 mm/kasan/report.c:303
   [...] __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:329
   [...] kvm_irq_delivery_to_apic_fast+0x11fa/0x1210 arch/x86/kvm/lapic.c:824
   [...] kvm_irq_delivery_to_apic+0x132/0x9a0 arch/x86/kvm/irq_comm.c:72
   [...] kvm_set_msi+0x111/0x160 arch/x86/kvm/irq_comm.c:157
   [...] kvm_send_userspace_msi+0x201/0x280 arch/x86/kvm/../../../virt/kvm/irqchip.c:74
   [...] kvm_vm_ioctl+0xba5/0x1670 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3015
   [...] vfs_ioctl fs/ioctl.c:43
   [...] do_vfs_ioctl+0x18c/0x1040 fs/ioctl.c:679
   [...] SYSC_ioctl fs/ioctl.c:694
   [...] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685
   [...] entry_SYSCALL_64_fastpath+0x1f/0xc2

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: e45115b62f ("KVM: x86: use physical LAPIC array for logical x2APIC")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:35:53 +01:00
Tom Lendacky
8370c3d08b kvm: svm: Add kvm_fast_pio_in support
Update the I/O interception support to add the kvm_fast_pio_in function
to speed up the in instruction similar to the out instruction.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:32:45 +01:00
Tom Lendacky
147277540b kvm: svm: Add support for additional SVM NPF error codes
AMD hardware adds two additional bits to aid in nested page fault handling.

Bit 32 - NPF occurred while translating the guest's final physical address
Bit 33 - NPF occurred while translating the guest page tables

The guest page tables fault indicator can be used as an aid for nested
virtualization. Using V0 for the host, V1 for the first level guest and
V2 for the second level guest, when both V1 and V2 are using nested paging
there are currently a number of unnecessary instruction emulations. When
V2 is launched shadow paging is used in V1 for the nested tables of V2. As
a result, KVM marks these pages as RO in the host nested page tables. When
V2 exits and we resume V1, these pages are still marked RO.

Every nested walk for a guest page table is treated as a user-level write
access and this causes a lot of NPFs because the V1 page tables are marked
RO in the V0 nested tables. While executing V1, when these NPFs occur KVM
sees a write to a read-only page, emulates the V1 instruction and unprotects
the page (marking it RW). This patch looks for cases where we get a NPF due
to a guest page table walk where the page was marked RO. It immediately
unprotects the page and resumes the guest, leading to far fewer instruction
emulations when nested virtualization is used.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:32:26 +01:00
Paul Burton
1031398035 MIPS: Mask out limit field when calculating wired entry count
Since MIPSr6 the Wired register is split into 2 fields, with the upper
16 bits of the register indicating a limit on the value that the wired
entry count in the bottom 16 bits of the register can take. This means
that simply reading the wired register doesn't get us a valid TLB entry
index any longer, and we instead need to retrieve only the lower 16 bits
of the register. Introduce a new num_wired_entries() function which does
this on MIPSr6 or higher and simply returns the value of the wired
register on older architecture revisions, and make use of it when
reading the number of wired entries.

Since commit e710d66683 ("MIPS: tlb-r4k: If there are wired entries,
don't use TLBINVF") we have been using a non-zero number of wired
entries to determine whether we should avoid use of the tlbinvf
instruction (which would invalidate wired entries) and instead loop over
TLB entries in local_flush_tlb_all(). This loop begins with the number
of wired entries, or before this patch some large bogus TLB index on
MIPSr6 systems. Thus since the aforementioned commit some MIPSr6 systems
with FTLBs have been prone to leaving stale address translations in the
FTLB & crashing in various weird & wonderful ways when we later observe
the wrong memory.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/14557/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2016-11-24 16:44:16 +01:00
Michael Ellerman
ddbefe7e77 Merge branch 'topic/ppc-kvm' into next
Merge the topic branch we're sharing with the kvm-ppc tree.
2016-11-24 22:14:52 +11:00
Oliver O'Halloran
a1ff57416a powerpc/boot: Fix the early OPAL console wrappers
When configured with CONFIG_PPC_EARLY_DEBUG_OPAL=y the kernel expects
the OPAL entry and base addresses to be passed in r8 and r9
respectively. Currently the wrapper does not attempt to restore these
values before entering the decompressed kernel which causes the kernel
to branch into whatever happens to be in r9 when doing a write to the
OPAL console in early boot.

This patch adds a platform_ops hook that can be used to branch into the
new kernel. The OPAL console driver patches this at runtime so that if
the console is used it will be restored just prior to entering the
kernel.

Fixes: 656ad58ef1 ("powerpc/boot: Add OPAL console to epapr wrappers")
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-11-24 17:34:01 +11:00
Ritesh Harjani
c987775aa4 arm64: dts: qcom: msm8916: Add ddr support to sdhc1
This adds mmc-ddr-1_8v support to DT for sdhc1 of msm8916.

Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org>
Signed-off-by: Andy Gross <andy.gross@linaro.org>
2016-11-24 00:33:26 -06:00
Ritesh Harjani
a91b2e690d ARM: dts: Add xo to sdhc clock node on qcom platforms
Add xo entry to sdhc clock node on all qcom platforms.

Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org>
Signed-off-by: Andy Gross <andy.gross@linaro.org>
2016-11-24 00:31:02 -06:00
Dan Carpenter
c4597fd756 x86/apic/uv: Silence a shift wrapping warning
'm_io' is stored in 6 bits so it's a number in the 0-63 range.  Static
analysis tools complain that 1 << 63 will wrap so I have changed it to
1ULL << m_io.

This code is over three years old so presumably the bug doesn't happen
very frequently in real life or someone would have complained by now.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Travis <travis@sgi.com>
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-janitors@vger.kernel.org
Fixes: b15cc4a12b ("x86, uv, uv3: Update x2apic Support for SGI UV3")
Link: http://lkml.kernel.org/r/20161123221908.GA23997@mwanda
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-24 06:01:05 +01:00
Dmitry Safonov
7b2dd36828 x86/coredump: Always use user_regs_struct for compat_elf_gregset_t
Commit:

  90954e7b94 ("x86/coredump: Use pr_reg size, rather that TIF_IA32 flag")

changed the coredumping code to construct the elf coredump file according
to register set size - and that's good: if binary crashes with 32-bit code
selector, generate 32-bit ELF core, otherwise - 64-bit core.

That was made for restoring 32-bit applications on x86_64: we want
32-bit application after restore to generate 32-bit ELF dump on crash.

All was quite good and recently I started reworking 32-bit applications
dumping part of CRIU: now it has two parasites (32 and 64) for seizing
compat/native tasks, after rework it'll have one parasite, working in
64-bit mode, to which 32-bit prologue long-jumps during infection.

And while it has worked for my work machine, in VM with
!CONFIG_X86_X32_ABI during reworking I faced that segfault in 32-bit
binary, that has long-jumped to 64-bit mode results in dereference
of garbage:

 32-victim[19266]: segfault at f775ef65 ip 00000000f775ef65 sp 00000000f776aa50 error 14
 BUG: unable to handle kernel paging request at ffffffffffffffff
 IP: [<ffffffff81332ce0>] strlen+0x0/0x20
 [...]
 Call Trace:
  [] elf_core_dump+0x11a9/0x1480
  [] do_coredump+0xa6b/0xe60
  [] get_signal+0x1a8/0x5c0
  [] do_signal+0x23/0x660
  [] exit_to_usermode_loop+0x34/0x65
  [] prepare_exit_to_usermode+0x2f/0x40
  [] retint_user+0x8/0x10

That's because we have 64-bit registers set (with according total size)
and we're writing it to elf_thread_core_info which has smaller size
on !CONFIG_X86_X32_ABI. That lead to overwriting ELF notes part.

Tested on 32-, 64-bit ELF crashes and on 32-bit binaries that have
jumped with 64-bit code selector - all is readable with gdb.

Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Fixes: 90954e7b94 ("x86/coredump: Use pr_reg size, rather that TIF_IA32 flag")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-24 06:01:05 +01:00
David Gibson
a8acaece5d KVM: PPC: Correctly report KVM_CAP_PPC_ALLOC_HTAB
At present KVM on powerpc always reports KVM_CAP_PPC_ALLOC_HTAB as enabled.
However, the ioctl() it advertises (KVM_PPC_ALLOCATE_HTAB) only actually
works on KVM HV.  On KVM PR it will fail with ENOTTY.

QEMU already has a workaround for this, so it's not breaking things in
practice, but it would be better to advertise this correctly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 14:23:05 +11:00
Paul Mackerras
e2702871b4 KVM: PPC: Book3S HV: Fix compilation with unusual configurations
This adds the "again" parameter to the dummy version of
kvmppc_check_passthru(), so that it matches the real version.
This fixes compilation with CONFIG_BOOK3S_64_HV set but
CONFIG_KVM_XICS=n.

This includes asm/smp.h in book3s_hv_builtin.c to fix compilation
with CONFIG_SMP=n.  The explicit inclusion is necessary to provide
definitions of hard_smp_processor_id() and get_hard_smp_processor_id()
in UP configs.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 14:20:00 +11:00
Suraj Jitindar Singh
2ee13be34b KVM: PPC: Book3S HV: Update kvmppc_set_arch_compat() for ISA v3.00
The function kvmppc_set_arch_compat() is used to determine the value of the
processor compatibility register (PCR) for a guest running in a given
compatibility mode. There is currently no support for v3.00 of the ISA.

Add support for v3.00 of the ISA which adds an ISA v2.07 compatilibity mode
to the PCR.

We also add a check to ensure the processor we are running on is capable of
emulating the chosen processor (for example a POWER7 cannot emulate a
POWER8, similarly with a POWER8 and a POWER9).

Based on work by: Paul Mackerras <paulus@ozlabs.org>

[paulus@ozlabs.org - moved dummy PCR_ARCH_300 definition here; set
 guest_pcr_bit when arch_compat == 0, added comment.]

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
45c940ba49 KVM: PPC: Book3S HV: Treat POWER9 CPU threads as independent subcores
With POWER9, each CPU thread has its own MMU context and can be
in the host or a guest independently of the other threads; there is
still however a restriction that all threads must use the same type
of address translation, either radix tree or hashed page table (HPT).

Since we only support HPT guests on a HPT host at this point, we
can treat the threads as being independent, and avoid all of the
work of coordinating the CPU threads.  To make this simpler, we
introduce a new threads_per_vcore() function that returns 1 on
POWER9 and threads_per_subcore on POWER7/8, and use that instead
of threads_per_subcore or threads_per_core in various places.

This also changes the value of the KVM_CAP_PPC_SMT capability on
POWER9 systems from 4 to 1, so that userspace will not try to
create VMs with multiple vcpus per vcore.  (If userspace did create
a VM that thought it was in an SMT mode, the VM might try to use
the msgsndp instruction, which will not work as expected.  In
future it may be possible to trap and emulate msgsndp in order to
allow VMs to think they are in an SMT mode, if only for the purpose
of allowing migration from POWER8 systems.)

With all this, we can now run guests on POWER9 as long as the host
is running with HPT translation.  Since userspace currently has no
way to request radix tree translation for the guest, the guest has
no choice but to use HPT translation.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
84f7139c06 KVM: PPC: Book3S HV: Enable hypervisor virtualization interrupts while in guest
The new XIVE interrupt controller on POWER9 can direct external
interrupts to the hypervisor or the guest.  The interrupts directed to
the hypervisor are controlled by an LPCR bit called LPCR_HVICE, and
come in as a "hypervisor virtualization interrupt".  This sets the
LPCR bit so that hypervisor virtualization interrupts can occur while
we are in the guest.  We then also need to cope with exiting the guest
because of a hypervisor virtualization interrupt.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
bf53c88e42 KVM: PPC: Book3S HV: Use stop instruction rather than nap on POWER9
POWER9 replaces the various power-saving mode instructions on POWER8
(doze, nap, sleep and rvwinkle) with a single "stop" instruction, plus
a register, PSSCR, which controls the depth of the power-saving mode.
This replaces the use of the nap instruction when threads are idle
during guest execution with the stop instruction, and adds code to
set PSSCR to a value which will allow an SMT mode switch while the
thread is idle (given that the core as a whole won't be idle in these
cases).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
f725758b89 KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9
POWER9 includes a new interrupt controller, called XIVE, which is
quite different from the XICS interrupt controller on POWER7 and
POWER8 machines.  KVM-HV accesses the XICS directly in several places
in order to send and clear IPIs and handle interrupts from PCI
devices being passed through to the guest.

In order to make the transition to XIVE easier, OPAL firmware will
include an emulation of XICS on top of XIVE.  Access to the emulated
XICS is via OPAL calls.  The one complication is that the EOI
(end-of-interrupt) function can now return a value indicating that
another interrupt is pending; in this case, the XIVE will not signal
an interrupt in hardware to the CPU, and software is supposed to
acknowledge the new interrupt without waiting for another interrupt
to be delivered in hardware.

This adapts KVM-HV to use the OPAL calls on machines where there is
no XICS hardware.  When there is no XICS, we look for a device-tree
node with "ibm,opal-intc" in its compatible property, which is how
OPAL indicates that it provides XICS emulation.

In order to handle the EOI return value, kvmppc_read_intr() has
become kvmppc_read_one_intr(), with a boolean variable passed by
reference which can be set by the EOI functions to indicate that
another interrupt is pending.  The new kvmppc_read_intr() keeps
calling kvmppc_read_one_intr() until there are no more interrupts
to process.  The return value from kvmppc_read_intr() is the
largest non-zero value of the returns from kvmppc_read_one_intr().

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
1704a81cce KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9
On POWER9, the msgsnd instruction is able to send interrupts to
other cores, as well as other threads on the local core.  Since
msgsnd is generally simpler and faster than sending an IPI via the
XICS, we use msgsnd for all IPIs sent by KVM on POWER9.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
7c5b06cadf KVM: PPC: Book3S HV: Adapt TLB invalidations to work on POWER9
POWER9 adds new capabilities to the tlbie (TLB invalidate entry)
and tlbiel (local tlbie) instructions.  Both instructions get a
set of new parameters (RIC, PRS and R) which appear as bits in the
instruction word.  The tlbiel instruction now has a second register
operand, which contains a PID and/or LPID value if needed, and
should otherwise contain 0.

This adapts KVM-HV's usage of tlbie and tlbiel to work on POWER9
as well as older processors.  Since we only handle HPT guests so
far, we need RIC=0 PRS=0 R=0, which ends up with the same instruction
word as on previous processors, so we don't need to conditionally
execute different instructions depending on the processor.

The local flush on first entry to a guest in book3s_hv_rmhandlers.S
is a loop which depends on the number of TLB sets.  Rather than
using feature sections to set the number of iterations based on
which CPU we're on, we now work out this number at VM creation time
and store it in the kvm_arch struct.  That will make it possible to
get the number from the device tree in future, which will help with
compatibility with future processors.

Since mmu_partition_table_set_entry() does a global flush of the
whole LPID, we don't need to do the TLB flush on first entry to the
guest on each processor.  Therefore we don't set all bits in the
tlb_need_flush bitmap on VM startup on POWER9.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
e9cf1e0856 KVM: PPC: Book3S HV: Add new POWER9 guest-accessible SPRs
This adds code to handle two new guest-accessible special-purpose
registers on POWER9: TIDR (thread ID register) and PSSCR (processor
stop status and control register).  They are context-switched
between host and guest, and the guest values can be read and set
via the one_reg interface.

The PSSCR contains some fields which are guest-accessible and some
which are only accessible in hypervisor mode.  We only allow the
guest-accessible fields to be read or set by userspace.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
83677f551e KVM: PPC: Book3S HV: Adjust host/guest context switch for POWER9
Some special-purpose registers that were present and accessible
by guests on POWER8 no longer exist on POWER9, so this adds
feature sections to ensure that we don't try to context-switch
them when going into or out of a guest on POWER9.  These are
all relatively obscure, rarely-used registers, but we had to
context-switch them on POWER8 to avoid creating a covert channel.
They are: SPMC1, SPMC2, MMCRS, CSIGR, TACR, TCSCR, and ACOP.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
7a84084c60 KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9
On POWER9, the SDR1 register (hashed page table base address) is no
longer used, and instead the hardware reads the HPT base address
and size from the partition table.  The partition table entry also
contains the bits that specify the page size for the VRMA mapping,
which were previously in the LPCR.  The VPM0 bit of the LPCR is
now reserved; the processor now always uses the VRMA (virtual
real-mode area) mechanism for guest real-mode accesses in HPT mode,
and the RMO (real-mode offset) mechanism has been dropped.

When entering or exiting the guest, we now only have to set the
LPIDR (logical partition ID register), not the SDR1 register.
There is also no requirement now to transition via a reserved
LPID value.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
abb7c7ddba KVM: PPC: Book3S HV: Adapt to new HPTE format on POWER9
This adapts the KVM-HV hashed page table (HPT) code to read and write
HPT entries in the new format defined in Power ISA v3.00 on POWER9
machines.  The new format moves the B (segment size) field from the
first doubleword to the second, and trims some bits from the AVA
(abbreviated virtual address) and ARPN (abbreviated real page number)
fields.  As far as possible, the conversion is done when reading or
writing the HPT entries, and the rest of the code continues to use
the old format.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
bc33b1fc83 Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in the ppc-kvm topic branch to get changes to
arch/powerpc code that are necessary for adding POWER9 KVM support.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:22:28 +11:00
Chris Metcalf
e658a6f14d tile: avoid using clocksource_cyc2ns with absolute cycle count
For large values of "mult" and long uptimes, the intermediate
result of "cycles * mult" can overflow 64 bits.  For example,
the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock;
we have mult = 853, and after 208.5 days, we overflow 64 bits.

Since clocksource_cyc2ns() is intended to be used for relative
cycle counts, not absolute cycle counts, performance is more
importance than accepting a wider range of cycle values.  So,
just use mult_frac() directly in tile's sched_clock().

Commit 4cecf6d401 ("sched, x86: Avoid unnecessary overflow
in sched_clock") by Salman Qazi results in essentially the same
generated code for x86 as this change does for tile.  In fact,
a follow-on change by Salman introduced mult_frac() and switched
to using it, so the C code was largely identical at that point too.

Peter Zijlstra then added mul_u64_u32_shr() and switched x86
to use it.  This is, in principle, better; by optimizing the
64x64->64 multiplies to be 32x32->64 multiplies we can potentially
save some time.  However, the compiler piplines the 64x64->64
multiplies pretty well, and the conditional branch in the generic
mul_u64_u32_shr() causes some bubbles in execution, with the
result that it's pretty much a wash.  If tilegx provided its own
implementation of mul_u64_u32_shr() without the conditional branch,
we could potentially save 3 cycles, but that seems like small gain
for a fair amount of additional build scaffolding; no other platform
currently provides a mul_u64_u32_shr() override, and tile doesn't
currently have an <asm/div64.h> header to put the override in.

Additionally, gcc currently has an optimization bug that prevents
it from recognizing the opportunity to use a 32x32->64 multiply,
and so the result would be no better than the existing mult_frac()
until such time as the compiler is fixed.

For now, just using mult_frac() seems like the right answer.

Cc: stable@kernel.org [v3.4+]
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
2016-11-23 15:28:54 -05:00