docs: kvm: Convert timekeeping.txt to ReST format
- Use document title and chapter markups; - Add markups for literal blocks; - Add markups for tables; - use :field: for field descriptions; - Add blank lines and adjust indentation. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This commit is contained in:
committed by
Paolo Bonzini
parent
a9700af64e
commit
6012d9a9fa
@@ -18,6 +18,7 @@ KVM
|
|||||||
nested-vmx
|
nested-vmx
|
||||||
ppc-pv
|
ppc-pv
|
||||||
s390-diag
|
s390-diag
|
||||||
|
timekeeping
|
||||||
vcpu-requests
|
vcpu-requests
|
||||||
|
|
||||||
arm/index
|
arm/index
|
||||||
|
|||||||
@@ -1,17 +1,21 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
Timekeeping Virtualization for X86-Based Architectures
|
======================================================
|
||||||
|
Timekeeping Virtualization for X86-Based Architectures
|
||||||
|
======================================================
|
||||||
|
|
||||||
Zachary Amsden <zamsden@redhat.com>
|
:Author: Zachary Amsden <zamsden@redhat.com>
|
||||||
Copyright (c) 2010, Red Hat. All rights reserved.
|
:Copyright: (c) 2010, Red Hat. All rights reserved.
|
||||||
|
|
||||||
1) Overview
|
.. Contents
|
||||||
2) Timing Devices
|
|
||||||
3) TSC Hardware
|
|
||||||
4) Virtualization Problems
|
|
||||||
|
|
||||||
=========================================================================
|
1) Overview
|
||||||
|
2) Timing Devices
|
||||||
|
3) TSC Hardware
|
||||||
|
4) Virtualization Problems
|
||||||
|
|
||||||
1) Overview
|
1. Overview
|
||||||
|
===========
|
||||||
|
|
||||||
One of the most complicated parts of the X86 platform, and specifically,
|
One of the most complicated parts of the X86 platform, and specifically,
|
||||||
the virtualization of this platform is the plethora of timing devices available
|
the virtualization of this platform is the plethora of timing devices available
|
||||||
@@ -27,15 +31,15 @@ The purpose of this document is to collect data and information relevant to
|
|||||||
timekeeping which may be difficult to find elsewhere, specifically,
|
timekeeping which may be difficult to find elsewhere, specifically,
|
||||||
information relevant to KVM and hardware-based virtualization.
|
information relevant to KVM and hardware-based virtualization.
|
||||||
|
|
||||||
=========================================================================
|
2. Timing Devices
|
||||||
|
=================
|
||||||
2) Timing Devices
|
|
||||||
|
|
||||||
First we discuss the basic hardware devices available. TSC and the related
|
First we discuss the basic hardware devices available. TSC and the related
|
||||||
KVM clock are special enough to warrant a full exposition and are described in
|
KVM clock are special enough to warrant a full exposition and are described in
|
||||||
the following section.
|
the following section.
|
||||||
|
|
||||||
2.1) i8254 - PIT
|
2.1. i8254 - PIT
|
||||||
|
----------------
|
||||||
|
|
||||||
One of the first timer devices available is the programmable interrupt timer,
|
One of the first timer devices available is the programmable interrupt timer,
|
||||||
or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three
|
or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three
|
||||||
@@ -50,12 +54,12 @@ The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done
|
|||||||
using single or multiple byte access to the I/O ports. There are 6 modes
|
using single or multiple byte access to the I/O ports. There are 6 modes
|
||||||
available, but not all modes are available to all timers, as only timer 2
|
available, but not all modes are available to all timers, as only timer 2
|
||||||
has a connected gate input, required for modes 1 and 5. The gate line is
|
has a connected gate input, required for modes 1 and 5. The gate line is
|
||||||
controlled by port 61h, bit 0, as illustrated in the following diagram.
|
controlled by port 61h, bit 0, as illustrated in the following diagram::
|
||||||
|
|
||||||
-------------- ----------------
|
-------------- ----------------
|
||||||
| | | |
|
| | | |
|
||||||
| 1.1932 MHz |---------->| CLOCK OUT | ---------> IRQ 0
|
| 1.1932 MHz|---------->| CLOCK OUT | ---------> IRQ 0
|
||||||
| Clock | | | |
|
| Clock | | | |
|
||||||
-------------- | +->| GATE TIMER 0 |
|
-------------- | +->| GATE TIMER 0 |
|
||||||
| ----------------
|
| ----------------
|
||||||
|
|
|
|
||||||
@@ -70,29 +74,33 @@ controlled by port 61h, bit 0, as illustrated in the following diagram.
|
|||||||
| | |
|
| | |
|
||||||
|------>| CLOCK OUT | ---------> Port 61h, bit 5
|
|------>| CLOCK OUT | ---------> Port 61h, bit 5
|
||||||
| | |
|
| | |
|
||||||
Port 61h, bit 0 ---------->| GATE TIMER 2 | \_.---- ____
|
Port 61h, bit 0 -------->| GATE TIMER 2 | \_.---- ____
|
||||||
---------------- _| )--|LPF|---Speaker
|
---------------- _| )--|LPF|---Speaker
|
||||||
/ *---- \___/
|
/ *---- \___/
|
||||||
Port 61h, bit 1 -----------------------------------/
|
Port 61h, bit 1 ---------------------------------/
|
||||||
|
|
||||||
The timer modes are now described.
|
The timer modes are now described.
|
||||||
|
|
||||||
Mode 0: Single Timeout. This is a one-shot software timeout that counts down
|
Mode 0: Single Timeout.
|
||||||
|
This is a one-shot software timeout that counts down
|
||||||
when the gate is high (always true for timers 0 and 1). When the count
|
when the gate is high (always true for timers 0 and 1). When the count
|
||||||
reaches zero, the output goes high.
|
reaches zero, the output goes high.
|
||||||
|
|
||||||
Mode 1: Triggered One-shot. The output is initially set high. When the gate
|
Mode 1: Triggered One-shot.
|
||||||
|
The output is initially set high. When the gate
|
||||||
line is set high, a countdown is initiated (which does not stop if the gate is
|
line is set high, a countdown is initiated (which does not stop if the gate is
|
||||||
lowered), during which the output is set low. When the count reaches zero,
|
lowered), during which the output is set low. When the count reaches zero,
|
||||||
the output goes high.
|
the output goes high.
|
||||||
|
|
||||||
Mode 2: Rate Generator. The output is initially set high. When the countdown
|
Mode 2: Rate Generator.
|
||||||
|
The output is initially set high. When the countdown
|
||||||
reaches 1, the output goes low for one count and then returns high. The value
|
reaches 1, the output goes low for one count and then returns high. The value
|
||||||
is reloaded and the countdown automatically resumes. If the gate line goes
|
is reloaded and the countdown automatically resumes. If the gate line goes
|
||||||
low, the count is halted. If the output is low when the gate is lowered, the
|
low, the count is halted. If the output is low when the gate is lowered, the
|
||||||
output automatically goes high (this only affects timer 2).
|
output automatically goes high (this only affects timer 2).
|
||||||
|
|
||||||
Mode 3: Square Wave. This generates a high / low square wave. The count
|
Mode 3: Square Wave.
|
||||||
|
This generates a high / low square wave. The count
|
||||||
determines the length of the pulse, which alternates between high and low
|
determines the length of the pulse, which alternates between high and low
|
||||||
when zero is reached. The count only proceeds when gate is high and is
|
when zero is reached. The count only proceeds when gate is high and is
|
||||||
automatically reloaded on reaching zero. The count is decremented twice at
|
automatically reloaded on reaching zero. The count is decremented twice at
|
||||||
@@ -103,12 +111,14 @@ Mode 3: Square Wave. This generates a high / low square wave. The count
|
|||||||
values are not observed when reading. This is the intended mode for timer 2,
|
values are not observed when reading. This is the intended mode for timer 2,
|
||||||
which generates sine-like tones by low-pass filtering the square wave output.
|
which generates sine-like tones by low-pass filtering the square wave output.
|
||||||
|
|
||||||
Mode 4: Software Strobe. After programming this mode and loading the counter,
|
Mode 4: Software Strobe.
|
||||||
|
After programming this mode and loading the counter,
|
||||||
the output remains high until the counter reaches zero. Then the output
|
the output remains high until the counter reaches zero. Then the output
|
||||||
goes low for 1 clock cycle and returns high. The counter is not reloaded.
|
goes low for 1 clock cycle and returns high. The counter is not reloaded.
|
||||||
Counting only occurs when gate is high.
|
Counting only occurs when gate is high.
|
||||||
|
|
||||||
Mode 5: Hardware Strobe. After programming and loading the counter, the
|
Mode 5: Hardware Strobe.
|
||||||
|
After programming and loading the counter, the
|
||||||
output remains high. When the gate is raised, a countdown is initiated
|
output remains high. When the gate is raised, a countdown is initiated
|
||||||
(which does not stop if the gate is lowered). When the counter reaches zero,
|
(which does not stop if the gate is lowered). When the counter reaches zero,
|
||||||
the output goes low for 1 clock cycle and then returns high. The counter is
|
the output goes low for 1 clock cycle and then returns high. The counter is
|
||||||
@@ -118,49 +128,49 @@ In addition to normal binary counting, the PIT supports BCD counting. The
|
|||||||
command port, 0x43 is used to set the counter and mode for each of the three
|
command port, 0x43 is used to set the counter and mode for each of the three
|
||||||
timers.
|
timers.
|
||||||
|
|
||||||
PIT commands, issued to port 0x43, using the following bit encoding:
|
PIT commands, issued to port 0x43, using the following bit encoding::
|
||||||
|
|
||||||
Bit 7-4: Command (See table below)
|
Bit 7-4: Command (See table below)
|
||||||
Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
|
Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
|
||||||
Bit 0 : Binary (0) / BCD (1)
|
Bit 0 : Binary (0) / BCD (1)
|
||||||
|
|
||||||
Command table:
|
Command table::
|
||||||
|
|
||||||
0000 - Latch Timer 0 count for port 0x40
|
0000 - Latch Timer 0 count for port 0x40
|
||||||
sample and hold the count to be read in port 0x40;
|
sample and hold the count to be read in port 0x40;
|
||||||
additional commands ignored until counter is read;
|
additional commands ignored until counter is read;
|
||||||
mode bits ignored.
|
mode bits ignored.
|
||||||
|
|
||||||
0001 - Set Timer 0 LSB mode for port 0x40
|
0001 - Set Timer 0 LSB mode for port 0x40
|
||||||
set timer to read LSB only and force MSB to zero;
|
set timer to read LSB only and force MSB to zero;
|
||||||
mode bits set timer mode
|
mode bits set timer mode
|
||||||
|
|
||||||
0010 - Set Timer 0 MSB mode for port 0x40
|
0010 - Set Timer 0 MSB mode for port 0x40
|
||||||
set timer to read MSB only and force LSB to zero;
|
set timer to read MSB only and force LSB to zero;
|
||||||
mode bits set timer mode
|
mode bits set timer mode
|
||||||
|
|
||||||
0011 - Set Timer 0 16-bit mode for port 0x40
|
0011 - Set Timer 0 16-bit mode for port 0x40
|
||||||
set timer to read / write LSB first, then MSB;
|
set timer to read / write LSB first, then MSB;
|
||||||
mode bits set timer mode
|
mode bits set timer mode
|
||||||
|
|
||||||
0100 - Latch Timer 1 count for port 0x41 - as described above
|
0100 - Latch Timer 1 count for port 0x41 - as described above
|
||||||
0101 - Set Timer 1 LSB mode for port 0x41 - as described above
|
0101 - Set Timer 1 LSB mode for port 0x41 - as described above
|
||||||
0110 - Set Timer 1 MSB mode for port 0x41 - as described above
|
0110 - Set Timer 1 MSB mode for port 0x41 - as described above
|
||||||
0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
|
0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
|
||||||
|
|
||||||
1000 - Latch Timer 2 count for port 0x42 - as described above
|
1000 - Latch Timer 2 count for port 0x42 - as described above
|
||||||
1001 - Set Timer 2 LSB mode for port 0x42 - as described above
|
1001 - Set Timer 2 LSB mode for port 0x42 - as described above
|
||||||
1010 - Set Timer 2 MSB mode for port 0x42 - as described above
|
1010 - Set Timer 2 MSB mode for port 0x42 - as described above
|
||||||
1011 - Set Timer 2 16-bit mode for port 0x42 as described above
|
1011 - Set Timer 2 16-bit mode for port 0x42 as described above
|
||||||
|
|
||||||
1101 - General counter latch
|
1101 - General counter latch
|
||||||
Latch combination of counters into corresponding ports
|
Latch combination of counters into corresponding ports
|
||||||
Bit 3 = Counter 2
|
Bit 3 = Counter 2
|
||||||
Bit 2 = Counter 1
|
Bit 2 = Counter 1
|
||||||
Bit 1 = Counter 0
|
Bit 1 = Counter 0
|
||||||
Bit 0 = Unused
|
Bit 0 = Unused
|
||||||
|
|
||||||
1110 - Latch timer status
|
1110 - Latch timer status
|
||||||
Latch combination of counter mode into corresponding ports
|
Latch combination of counter mode into corresponding ports
|
||||||
Bit 3 = Counter 2
|
Bit 3 = Counter 2
|
||||||
Bit 2 = Counter 1
|
Bit 2 = Counter 1
|
||||||
@@ -177,7 +187,8 @@ Command table:
|
|||||||
Bit 3-1 = Mode
|
Bit 3-1 = Mode
|
||||||
Bit 0 = Binary (0) / BCD mode (1)
|
Bit 0 = Binary (0) / BCD mode (1)
|
||||||
|
|
||||||
2.2) RTC
|
2.2. RTC
|
||||||
|
--------
|
||||||
|
|
||||||
The second device which was available in the original PC was the MC146818 real
|
The second device which was available in the original PC was the MC146818 real
|
||||||
time clock. The original device is now obsolete, and usually emulated by the
|
time clock. The original device is now obsolete, and usually emulated by the
|
||||||
@@ -201,21 +212,21 @@ in progress, as indicated in the status register.
|
|||||||
The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
|
The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
|
||||||
programmed to a 32kHz divider if the RTC is to count seconds.
|
programmed to a 32kHz divider if the RTC is to count seconds.
|
||||||
|
|
||||||
This is the RAM map originally used for the RTC/CMOS:
|
This is the RAM map originally used for the RTC/CMOS::
|
||||||
|
|
||||||
Location Size Description
|
Location Size Description
|
||||||
------------------------------------------
|
------------------------------------------
|
||||||
00h byte Current second (BCD)
|
00h byte Current second (BCD)
|
||||||
01h byte Seconds alarm (BCD)
|
01h byte Seconds alarm (BCD)
|
||||||
02h byte Current minute (BCD)
|
02h byte Current minute (BCD)
|
||||||
03h byte Minutes alarm (BCD)
|
03h byte Minutes alarm (BCD)
|
||||||
04h byte Current hour (BCD)
|
04h byte Current hour (BCD)
|
||||||
05h byte Hours alarm (BCD)
|
05h byte Hours alarm (BCD)
|
||||||
06h byte Current day of week (BCD)
|
06h byte Current day of week (BCD)
|
||||||
07h byte Current day of month (BCD)
|
07h byte Current day of month (BCD)
|
||||||
08h byte Current month (BCD)
|
08h byte Current month (BCD)
|
||||||
09h byte Current year (BCD)
|
09h byte Current year (BCD)
|
||||||
0Ah byte Register A
|
0Ah byte Register A
|
||||||
bit 7 = Update in progress
|
bit 7 = Update in progress
|
||||||
bit 6-4 = Divider for clock
|
bit 6-4 = Divider for clock
|
||||||
000 = 4.194 MHz
|
000 = 4.194 MHz
|
||||||
@@ -234,7 +245,7 @@ Location Size Description
|
|||||||
1101 = 125 mS
|
1101 = 125 mS
|
||||||
1110 = 250 mS
|
1110 = 250 mS
|
||||||
1111 = 500 mS
|
1111 = 500 mS
|
||||||
0Bh byte Register B
|
0Bh byte Register B
|
||||||
bit 7 = Run (0) / Halt (1)
|
bit 7 = Run (0) / Halt (1)
|
||||||
bit 6 = Periodic interrupt enable
|
bit 6 = Periodic interrupt enable
|
||||||
bit 5 = Alarm interrupt enable
|
bit 5 = Alarm interrupt enable
|
||||||
@@ -243,19 +254,20 @@ Location Size Description
|
|||||||
bit 2 = BCD calendar (0) / Binary (1)
|
bit 2 = BCD calendar (0) / Binary (1)
|
||||||
bit 1 = 12-hour mode (0) / 24-hour mode (1)
|
bit 1 = 12-hour mode (0) / 24-hour mode (1)
|
||||||
bit 0 = 0 (DST off) / 1 (DST enabled)
|
bit 0 = 0 (DST off) / 1 (DST enabled)
|
||||||
OCh byte Register C (read only)
|
OCh byte Register C (read only)
|
||||||
bit 7 = interrupt request flag (IRQF)
|
bit 7 = interrupt request flag (IRQF)
|
||||||
bit 6 = periodic interrupt flag (PF)
|
bit 6 = periodic interrupt flag (PF)
|
||||||
bit 5 = alarm interrupt flag (AF)
|
bit 5 = alarm interrupt flag (AF)
|
||||||
bit 4 = update interrupt flag (UF)
|
bit 4 = update interrupt flag (UF)
|
||||||
bit 3-0 = reserved
|
bit 3-0 = reserved
|
||||||
ODh byte Register D (read only)
|
ODh byte Register D (read only)
|
||||||
bit 7 = RTC has power
|
bit 7 = RTC has power
|
||||||
bit 6-0 = reserved
|
bit 6-0 = reserved
|
||||||
32h byte Current century BCD (*)
|
32h byte Current century BCD (*)
|
||||||
(*) location vendor specific and now determined from ACPI global tables
|
(*) location vendor specific and now determined from ACPI global tables
|
||||||
|
|
||||||
2.3) APIC
|
2.3. APIC
|
||||||
|
---------
|
||||||
|
|
||||||
On Pentium and later processors, an on-board timer is available to each CPU
|
On Pentium and later processors, an on-board timer is available to each CPU
|
||||||
as part of the Advanced Programmable Interrupt Controller. The APIC is
|
as part of the Advanced Programmable Interrupt Controller. The APIC is
|
||||||
@@ -276,7 +288,8 @@ timer is programmed through the LVT (local vector timer) register, is capable
|
|||||||
of one-shot or periodic operation, and is based on the bus clock divided down
|
of one-shot or periodic operation, and is based on the bus clock divided down
|
||||||
by the programmable divider register.
|
by the programmable divider register.
|
||||||
|
|
||||||
2.4) HPET
|
2.4. HPET
|
||||||
|
---------
|
||||||
|
|
||||||
HPET is quite complex, and was originally intended to replace the PIT / RTC
|
HPET is quite complex, and was originally intended to replace the PIT / RTC
|
||||||
support of the X86 PC. It remains to be seen whether that will be the case, as
|
support of the X86 PC. It remains to be seen whether that will be the case, as
|
||||||
@@ -297,7 +310,8 @@ indicated through ACPI tables by the BIOS.
|
|||||||
Detailed specification of the HPET is beyond the current scope of this
|
Detailed specification of the HPET is beyond the current scope of this
|
||||||
document, as it is also very well documented elsewhere.
|
document, as it is also very well documented elsewhere.
|
||||||
|
|
||||||
2.5) Offboard Timers
|
2.5. Offboard Timers
|
||||||
|
--------------------
|
||||||
|
|
||||||
Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
|
Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
|
||||||
timing chips built into the cards which may have registers which are accessible
|
timing chips built into the cards which may have registers which are accessible
|
||||||
@@ -307,9 +321,8 @@ general frowned upon as not playing by the agreed rules of the game. Such a
|
|||||||
timer device would require additional support to be virtualized properly and is
|
timer device would require additional support to be virtualized properly and is
|
||||||
not considered important at this time as no known operating system does this.
|
not considered important at this time as no known operating system does this.
|
||||||
|
|
||||||
=========================================================================
|
3. TSC Hardware
|
||||||
|
===============
|
||||||
3) TSC Hardware
|
|
||||||
|
|
||||||
The TSC or time stamp counter is relatively simple in theory; it counts
|
The TSC or time stamp counter is relatively simple in theory; it counts
|
||||||
instruction cycles issued by the processor, which can be used as a measure of
|
instruction cycles issued by the processor, which can be used as a measure of
|
||||||
@@ -340,7 +353,8 @@ allows the guest visible TSC to be offset by a constant. Newer implementations
|
|||||||
promise to allow the TSC to additionally be scaled, but this hardware is not
|
promise to allow the TSC to additionally be scaled, but this hardware is not
|
||||||
yet widely available.
|
yet widely available.
|
||||||
|
|
||||||
3.1) TSC synchronization
|
3.1. TSC synchronization
|
||||||
|
------------------------
|
||||||
|
|
||||||
The TSC is a CPU-local clock in most implementations. This means, on SMP
|
The TSC is a CPU-local clock in most implementations. This means, on SMP
|
||||||
platforms, the TSCs of different CPUs may start at different times depending
|
platforms, the TSCs of different CPUs may start at different times depending
|
||||||
@@ -357,7 +371,8 @@ practice, getting a perfectly synchronized TSC will not be possible unless all
|
|||||||
values are read from the same clock, which generally only is possible on single
|
values are read from the same clock, which generally only is possible on single
|
||||||
socket systems or those with special hardware support.
|
socket systems or those with special hardware support.
|
||||||
|
|
||||||
3.2) TSC and CPU hotplug
|
3.2. TSC and CPU hotplug
|
||||||
|
------------------------
|
||||||
|
|
||||||
As touched on already, CPUs which arrive later than the boot time of the system
|
As touched on already, CPUs which arrive later than the boot time of the system
|
||||||
may not have a TSC value that is synchronized with the rest of the system.
|
may not have a TSC value that is synchronized with the rest of the system.
|
||||||
@@ -367,7 +382,8 @@ a guarantee. This can have the effect of bringing a system from a state where
|
|||||||
TSC is synchronized back to a state where TSC synchronization flaws, however
|
TSC is synchronized back to a state where TSC synchronization flaws, however
|
||||||
small, may be exposed to the OS and any virtualization environment.
|
small, may be exposed to the OS and any virtualization environment.
|
||||||
|
|
||||||
3.3) TSC and multi-socket / NUMA
|
3.3. TSC and multi-socket / NUMA
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
Multi-socket systems, especially large multi-socket systems are likely to have
|
Multi-socket systems, especially large multi-socket systems are likely to have
|
||||||
individual clocksources rather than a single, universally distributed clock.
|
individual clocksources rather than a single, universally distributed clock.
|
||||||
@@ -385,7 +401,8 @@ standards for telecommunications and computer equipment.
|
|||||||
It is recommended not to trust the TSCs to remain synchronized on NUMA or
|
It is recommended not to trust the TSCs to remain synchronized on NUMA or
|
||||||
multiple socket systems for these reasons.
|
multiple socket systems for these reasons.
|
||||||
|
|
||||||
3.4) TSC and C-states
|
3.4. TSC and C-states
|
||||||
|
---------------------
|
||||||
|
|
||||||
C-states, or idling states of the processor, especially C1E and deeper sleep
|
C-states, or idling states of the processor, especially C1E and deeper sleep
|
||||||
states may be problematic for TSC as well. The TSC may stop advancing in such
|
states may be problematic for TSC as well. The TSC may stop advancing in such
|
||||||
@@ -396,7 +413,8 @@ based on CPU and chipset identifications.
|
|||||||
The TSC in such a case may be corrected by catching it up to a known external
|
The TSC in such a case may be corrected by catching it up to a known external
|
||||||
clocksource.
|
clocksource.
|
||||||
|
|
||||||
3.5) TSC frequency change / P-states
|
3.5. TSC frequency change / P-states
|
||||||
|
------------------------------------
|
||||||
|
|
||||||
To make things slightly more interesting, some CPUs may change frequency. They
|
To make things slightly more interesting, some CPUs may change frequency. They
|
||||||
may or may not run the TSC at the same rate, and because the frequency change
|
may or may not run the TSC at the same rate, and because the frequency change
|
||||||
@@ -416,14 +434,16 @@ other processors. In such cases, the TSC on halted CPUs could advance faster
|
|||||||
than that of non-halted processors. AMD Turion processors are known to have
|
than that of non-halted processors. AMD Turion processors are known to have
|
||||||
this problem.
|
this problem.
|
||||||
|
|
||||||
3.6) TSC and STPCLK / T-states
|
3.6. TSC and STPCLK / T-states
|
||||||
|
------------------------------
|
||||||
|
|
||||||
External signals given to the processor may also have the effect of stopping
|
External signals given to the processor may also have the effect of stopping
|
||||||
the TSC. This is typically done for thermal emergency power control to prevent
|
the TSC. This is typically done for thermal emergency power control to prevent
|
||||||
an overheating condition, and typically, there is no way to detect that this
|
an overheating condition, and typically, there is no way to detect that this
|
||||||
condition has happened.
|
condition has happened.
|
||||||
|
|
||||||
3.7) TSC virtualization - VMX
|
3.7. TSC virtualization - VMX
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
|
VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
|
||||||
instructions, which is enough for full virtualization of TSC in any manner. In
|
instructions, which is enough for full virtualization of TSC in any manner. In
|
||||||
@@ -431,14 +451,16 @@ addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
|
|||||||
field specified in the VMCS. Special instructions must be used to read and
|
field specified in the VMCS. Special instructions must be used to read and
|
||||||
write the VMCS field.
|
write the VMCS field.
|
||||||
|
|
||||||
3.8) TSC virtualization - SVM
|
3.8. TSC virtualization - SVM
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
|
SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
|
||||||
instructions, which is enough for full virtualization of TSC in any manner. In
|
instructions, which is enough for full virtualization of TSC in any manner. In
|
||||||
addition, SVM allows passing through the host TSC plus an additional offset
|
addition, SVM allows passing through the host TSC plus an additional offset
|
||||||
field specified in the SVM control block.
|
field specified in the SVM control block.
|
||||||
|
|
||||||
3.9) TSC feature bits in Linux
|
3.9. TSC feature bits in Linux
|
||||||
|
------------------------------
|
||||||
|
|
||||||
In summary, there is no way to guarantee the TSC remains in perfect
|
In summary, there is no way to guarantee the TSC remains in perfect
|
||||||
synchronization unless it is explicitly guaranteed by the architecture. Even
|
synchronization unless it is explicitly guaranteed by the architecture. Even
|
||||||
@@ -448,13 +470,16 @@ despite being locally consistent.
|
|||||||
The following feature bits are used by Linux to signal various TSC attributes,
|
The following feature bits are used by Linux to signal various TSC attributes,
|
||||||
but they can only be taken to be meaningful for UP or single node systems.
|
but they can only be taken to be meaningful for UP or single node systems.
|
||||||
|
|
||||||
X86_FEATURE_TSC : The TSC is available in hardware
|
========================= =======================================
|
||||||
X86_FEATURE_RDTSCP : The RDTSCP instruction is available
|
X86_FEATURE_TSC The TSC is available in hardware
|
||||||
X86_FEATURE_CONSTANT_TSC : The TSC rate is unchanged with P-states
|
X86_FEATURE_RDTSCP The RDTSCP instruction is available
|
||||||
X86_FEATURE_NONSTOP_TSC : The TSC does not stop in C-states
|
X86_FEATURE_CONSTANT_TSC The TSC rate is unchanged with P-states
|
||||||
X86_FEATURE_TSC_RELIABLE : TSC sync checks are skipped (VMware)
|
X86_FEATURE_NONSTOP_TSC The TSC does not stop in C-states
|
||||||
|
X86_FEATURE_TSC_RELIABLE TSC sync checks are skipped (VMware)
|
||||||
|
========================= =======================================
|
||||||
|
|
||||||
4) Virtualization Problems
|
4. Virtualization Problems
|
||||||
|
==========================
|
||||||
|
|
||||||
Timekeeping is especially problematic for virtualization because a number of
|
Timekeeping is especially problematic for virtualization because a number of
|
||||||
challenges arise. The most obvious problem is that time is now shared between
|
challenges arise. The most obvious problem is that time is now shared between
|
||||||
@@ -473,7 +498,8 @@ BIOS, but not in such an extreme fashion. However, the fact that SMM mode may
|
|||||||
cause similar problems to virtualization makes it a good justification for
|
cause similar problems to virtualization makes it a good justification for
|
||||||
solving many of these problems on bare metal.
|
solving many of these problems on bare metal.
|
||||||
|
|
||||||
4.1) Interrupt clocking
|
4.1. Interrupt clocking
|
||||||
|
-----------------------
|
||||||
|
|
||||||
One of the most immediate problems that occurs with legacy operating systems
|
One of the most immediate problems that occurs with legacy operating systems
|
||||||
is that the system timekeeping routines are often designed to keep track of
|
is that the system timekeeping routines are often designed to keep track of
|
||||||
@@ -502,7 +528,8 @@ thus requires interrupt slewing to keep proper time. It does use a low enough
|
|||||||
rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
|
rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
|
||||||
practice.
|
practice.
|
||||||
|
|
||||||
4.2) TSC sampling and serialization
|
4.2. TSC sampling and serialization
|
||||||
|
-----------------------------------
|
||||||
|
|
||||||
As the highest precision time source available, the cycle counter of the CPU
|
As the highest precision time source available, the cycle counter of the CPU
|
||||||
has aroused much interest from developers. As explained above, this timer has
|
has aroused much interest from developers. As explained above, this timer has
|
||||||
@@ -524,7 +551,8 @@ it may be necessary for an implementation to guard against "backwards" reads of
|
|||||||
the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
|
the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
|
||||||
system.
|
system.
|
||||||
|
|
||||||
4.3) Timespec aliasing
|
4.3. Timespec aliasing
|
||||||
|
----------------------
|
||||||
|
|
||||||
Additionally, this lack of serialization from the TSC poses another challenge
|
Additionally, this lack of serialization from the TSC poses another challenge
|
||||||
when using results of the TSC when measured against another time source. As
|
when using results of the TSC when measured against another time source. As
|
||||||
@@ -548,7 +576,8 @@ This aliasing requires care in the computation and recalibration of kvmclock
|
|||||||
and any other values derived from TSC computation (such as TSC virtualization
|
and any other values derived from TSC computation (such as TSC virtualization
|
||||||
itself).
|
itself).
|
||||||
|
|
||||||
4.4) Migration
|
4.4. Migration
|
||||||
|
--------------
|
||||||
|
|
||||||
Migration of a virtual machine raises problems for timekeeping in two ways.
|
Migration of a virtual machine raises problems for timekeeping in two ways.
|
||||||
First, the migration itself may take time, during which interrupts cannot be
|
First, the migration itself may take time, during which interrupts cannot be
|
||||||
@@ -566,7 +595,8 @@ always be caught up to the original rate. KVM clock avoids these problems by
|
|||||||
simply storing multipliers and offsets against the TSC for the guest to convert
|
simply storing multipliers and offsets against the TSC for the guest to convert
|
||||||
back into nanosecond resolution values.
|
back into nanosecond resolution values.
|
||||||
|
|
||||||
4.5) Scheduling
|
4.5. Scheduling
|
||||||
|
---------------
|
||||||
|
|
||||||
Since scheduling may be based on precise timing and firing of interrupts, the
|
Since scheduling may be based on precise timing and firing of interrupts, the
|
||||||
scheduling algorithms of an operating system may be adversely affected by
|
scheduling algorithms of an operating system may be adversely affected by
|
||||||
@@ -579,7 +609,8 @@ In an attempt to work around this, several implementations have provided a
|
|||||||
paravirtualized scheduler clock, which reveals the true amount of CPU time for
|
paravirtualized scheduler clock, which reveals the true amount of CPU time for
|
||||||
which a virtual machine has been running.
|
which a virtual machine has been running.
|
||||||
|
|
||||||
4.6) Watchdogs
|
4.6. Watchdogs
|
||||||
|
--------------
|
||||||
|
|
||||||
Watchdog timers, such as the lock detector in Linux may fire accidentally when
|
Watchdog timers, such as the lock detector in Linux may fire accidentally when
|
||||||
running under hardware virtualization due to timer interrupts being delayed or
|
running under hardware virtualization due to timer interrupts being delayed or
|
||||||
@@ -587,7 +618,8 @@ misinterpretation of the passage of real time. Usually, these warnings are
|
|||||||
spurious and can be ignored, but in some circumstances it may be necessary to
|
spurious and can be ignored, but in some circumstances it may be necessary to
|
||||||
disable such detection.
|
disable such detection.
|
||||||
|
|
||||||
4.7) Delays and precision timing
|
4.7. Delays and precision timing
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
Precise timing and delays may not be possible in a virtualized system. This
|
Precise timing and delays may not be possible in a virtualized system. This
|
||||||
can happen if the system is controlling physical hardware, or issues delays to
|
can happen if the system is controlling physical hardware, or issues delays to
|
||||||
@@ -600,7 +632,8 @@ The second issue may cause performance problems, but this is unlikely to be a
|
|||||||
significant issue. In many cases these delays may be eliminated through
|
significant issue. In many cases these delays may be eliminated through
|
||||||
configuration or paravirtualization.
|
configuration or paravirtualization.
|
||||||
|
|
||||||
4.8) Covert channels and leaks
|
4.8. Covert channels and leaks
|
||||||
|
------------------------------
|
||||||
|
|
||||||
In addition to the above problems, time information will inevitably leak to the
|
In addition to the above problems, time information will inevitably leak to the
|
||||||
guest about the host in anything but a perfect implementation of virtualized
|
guest about the host in anything but a perfect implementation of virtualized
|
||||||
Reference in New Issue
Block a user