Merge branch 'x86/setup' into x86/devel
This commit is contained in:
119
Documentation/x86/i386/IO-APIC.txt
Normal file
119
Documentation/x86/i386/IO-APIC.txt
Normal file
@@ -0,0 +1,119 @@
|
||||
Most (all) Intel-MP compliant SMP boards have the so-called 'IO-APIC',
|
||||
which is an enhanced interrupt controller. It enables us to route
|
||||
hardware interrupts to multiple CPUs, or to CPU groups. Without an
|
||||
IO-APIC, interrupts from hardware will be delivered only to the
|
||||
CPU which boots the operating system (usually CPU#0).
|
||||
|
||||
Linux supports all variants of compliant SMP boards, including ones with
|
||||
multiple IO-APICs. Multiple IO-APICs are used in high-end servers to
|
||||
distribute IRQ load further.
|
||||
|
||||
There are (a few) known breakages in certain older boards, such bugs are
|
||||
usually worked around by the kernel. If your MP-compliant SMP board does
|
||||
not boot Linux, then consult the linux-smp mailing list archives first.
|
||||
|
||||
If your box boots fine with enabled IO-APIC IRQs, then your
|
||||
/proc/interrupts will look like this one:
|
||||
|
||||
---------------------------->
|
||||
hell:~> cat /proc/interrupts
|
||||
CPU0
|
||||
0: 1360293 IO-APIC-edge timer
|
||||
1: 4 IO-APIC-edge keyboard
|
||||
2: 0 XT-PIC cascade
|
||||
13: 1 XT-PIC fpu
|
||||
14: 1448 IO-APIC-edge ide0
|
||||
16: 28232 IO-APIC-level Intel EtherExpress Pro 10/100 Ethernet
|
||||
17: 51304 IO-APIC-level eth0
|
||||
NMI: 0
|
||||
ERR: 0
|
||||
hell:~>
|
||||
<----------------------------
|
||||
|
||||
Some interrupts are still listed as 'XT PIC', but this is not a problem;
|
||||
none of those IRQ sources is performance-critical.
|
||||
|
||||
|
||||
In the unlikely case that your board does not create a working mp-table,
|
||||
you can use the pirq= boot parameter to 'hand-construct' IRQ entries. This
|
||||
is non-trivial though and cannot be automated. One sample /etc/lilo.conf
|
||||
entry:
|
||||
|
||||
append="pirq=15,11,10"
|
||||
|
||||
The actual numbers depend on your system, on your PCI cards and on their
|
||||
PCI slot position. Usually PCI slots are 'daisy chained' before they are
|
||||
connected to the PCI chipset IRQ routing facility (the incoming PIRQ1-4
|
||||
lines):
|
||||
|
||||
,-. ,-. ,-. ,-. ,-.
|
||||
PIRQ4 ----| |-. ,-| |-. ,-| |-. ,-| |--------| |
|
||||
|S| \ / |S| \ / |S| \ / |S| |S|
|
||||
PIRQ3 ----|l|-. `/---|l|-. `/---|l|-. `/---|l|--------|l|
|
||||
|o| \/ |o| \/ |o| \/ |o| |o|
|
||||
PIRQ2 ----|t|-./`----|t|-./`----|t|-./`----|t|--------|t|
|
||||
|1| /\ |2| /\ |3| /\ |4| |5|
|
||||
PIRQ1 ----| |- `----| |- `----| |- `----| |--------| |
|
||||
`-' `-' `-' `-' `-'
|
||||
|
||||
Every PCI card emits a PCI IRQ, which can be INTA, INTB, INTC or INTD:
|
||||
|
||||
,-.
|
||||
INTD--| |
|
||||
|S|
|
||||
INTC--|l|
|
||||
|o|
|
||||
INTB--|t|
|
||||
|x|
|
||||
INTA--| |
|
||||
`-'
|
||||
|
||||
These INTA-D PCI IRQs are always 'local to the card', their real meaning
|
||||
depends on which slot they are in. If you look at the daisy chaining diagram,
|
||||
a card in slot4, issuing INTA IRQ, it will end up as a signal on PIRQ4 of
|
||||
the PCI chipset. Most cards issue INTA, this creates optimal distribution
|
||||
between the PIRQ lines. (distributing IRQ sources properly is not a
|
||||
necessity, PCI IRQs can be shared at will, but it's a good for performance
|
||||
to have non shared interrupts). Slot5 should be used for videocards, they
|
||||
do not use interrupts normally, thus they are not daisy chained either.
|
||||
|
||||
so if you have your SCSI card (IRQ11) in Slot1, Tulip card (IRQ9) in
|
||||
Slot2, then you'll have to specify this pirq= line:
|
||||
|
||||
append="pirq=11,9"
|
||||
|
||||
the following script tries to figure out such a default pirq= line from
|
||||
your PCI configuration:
|
||||
|
||||
echo -n pirq=; echo `scanpci | grep T_L | cut -c56-` | sed 's/ /,/g'
|
||||
|
||||
note that this script wont work if you have skipped a few slots or if your
|
||||
board does not do default daisy-chaining. (or the IO-APIC has the PIRQ pins
|
||||
connected in some strange way). E.g. if in the above case you have your SCSI
|
||||
card (IRQ11) in Slot3, and have Slot1 empty:
|
||||
|
||||
append="pirq=0,9,11"
|
||||
|
||||
[value '0' is a generic 'placeholder', reserved for empty (or non-IRQ emitting)
|
||||
slots.]
|
||||
|
||||
Generally, it's always possible to find out the correct pirq= settings, just
|
||||
permute all IRQ numbers properly ... it will take some time though. An
|
||||
'incorrect' pirq line will cause the booting process to hang, or a device
|
||||
won't function properly (e.g. if it's inserted as a module).
|
||||
|
||||
If you have 2 PCI buses, then you can use up to 8 pirq values, although such
|
||||
boards tend to have a good configuration.
|
||||
|
||||
Be prepared that it might happen that you need some strange pirq line:
|
||||
|
||||
append="pirq=0,0,0,0,0,0,9,11"
|
||||
|
||||
Use smart trial-and-error techniques to find out the correct pirq line ...
|
||||
|
||||
Good luck and mail to linux-smp@vger.kernel.org or
|
||||
linux-kernel@vger.kernel.org if you have any problems that are not covered
|
||||
by this document.
|
||||
|
||||
-- mingo
|
||||
|
900
Documentation/x86/i386/boot.txt
Normal file
900
Documentation/x86/i386/boot.txt
Normal file
@@ -0,0 +1,900 @@
|
||||
THE LINUX/x86 BOOT PROTOCOL
|
||||
---------------------------
|
||||
|
||||
On the x86 platform, the Linux kernel uses a rather complicated boot
|
||||
convention. This has evolved partially due to historical aspects, as
|
||||
well as the desire in the early days to have the kernel itself be a
|
||||
bootable image, the complicated PC memory model and due to changed
|
||||
expectations in the PC industry caused by the effective demise of
|
||||
real-mode DOS as a mainstream operating system.
|
||||
|
||||
Currently, the following versions of the Linux/x86 boot protocol exist.
|
||||
|
||||
Old kernels: zImage/Image support only. Some very early kernels
|
||||
may not even support a command line.
|
||||
|
||||
Protocol 2.00: (Kernel 1.3.73) Added bzImage and initrd support, as
|
||||
well as a formalized way to communicate between the
|
||||
boot loader and the kernel. setup.S made relocatable,
|
||||
although the traditional setup area still assumed
|
||||
writable.
|
||||
|
||||
Protocol 2.01: (Kernel 1.3.76) Added a heap overrun warning.
|
||||
|
||||
Protocol 2.02: (Kernel 2.4.0-test3-pre3) New command line protocol.
|
||||
Lower the conventional memory ceiling. No overwrite
|
||||
of the traditional setup area, thus making booting
|
||||
safe for systems which use the EBDA from SMM or 32-bit
|
||||
BIOS entry points. zImage deprecated but still
|
||||
supported.
|
||||
|
||||
Protocol 2.03: (Kernel 2.4.18-pre1) Explicitly makes the highest possible
|
||||
initrd address available to the bootloader.
|
||||
|
||||
Protocol 2.04: (Kernel 2.6.14) Extend the syssize field to four bytes.
|
||||
|
||||
Protocol 2.05: (Kernel 2.6.20) Make protected mode kernel relocatable.
|
||||
Introduce relocatable_kernel and kernel_alignment fields.
|
||||
|
||||
Protocol 2.06: (Kernel 2.6.22) Added a field that contains the size of
|
||||
the boot command line.
|
||||
|
||||
Protocol 2.07: (Kernel 2.6.24) Added paravirtualised boot protocol.
|
||||
Introduced hardware_subarch and hardware_subarch_data
|
||||
and KEEP_SEGMENTS flag in load_flags.
|
||||
|
||||
Protocol 2.08: (Kernel 2.6.26) Added crc32 checksum and ELF format
|
||||
payload. Introduced payload_offset and payload length
|
||||
fields to aid in locating the payload.
|
||||
|
||||
Protocol 2.09: (Kernel 2.6.26) Added a field of 64-bit physical
|
||||
pointer to single linked list of struct setup_data.
|
||||
|
||||
**** MEMORY LAYOUT
|
||||
|
||||
The traditional memory map for the kernel loader, used for Image or
|
||||
zImage kernels, typically looks like:
|
||||
|
||||
| |
|
||||
0A0000 +------------------------+
|
||||
| Reserved for BIOS | Do not use. Reserved for BIOS EBDA.
|
||||
09A000 +------------------------+
|
||||
| Command line |
|
||||
| Stack/heap | For use by the kernel real-mode code.
|
||||
098000 +------------------------+
|
||||
| Kernel setup | The kernel real-mode code.
|
||||
090200 +------------------------+
|
||||
| Kernel boot sector | The kernel legacy boot sector.
|
||||
090000 +------------------------+
|
||||
| Protected-mode kernel | The bulk of the kernel image.
|
||||
010000 +------------------------+
|
||||
| Boot loader | <- Boot sector entry point 0000:7C00
|
||||
001000 +------------------------+
|
||||
| Reserved for MBR/BIOS |
|
||||
000800 +------------------------+
|
||||
| Typically used by MBR |
|
||||
000600 +------------------------+
|
||||
| BIOS use only |
|
||||
000000 +------------------------+
|
||||
|
||||
|
||||
When using bzImage, the protected-mode kernel was relocated to
|
||||
0x100000 ("high memory"), and the kernel real-mode block (boot sector,
|
||||
setup, and stack/heap) was made relocatable to any address between
|
||||
0x10000 and end of low memory. Unfortunately, in protocols 2.00 and
|
||||
2.01 the 0x90000+ memory range is still used internally by the kernel;
|
||||
the 2.02 protocol resolves that problem.
|
||||
|
||||
It is desirable to keep the "memory ceiling" -- the highest point in
|
||||
low memory touched by the boot loader -- as low as possible, since
|
||||
some newer BIOSes have begun to allocate some rather large amounts of
|
||||
memory, called the Extended BIOS Data Area, near the top of low
|
||||
memory. The boot loader should use the "INT 12h" BIOS call to verify
|
||||
how much low memory is available.
|
||||
|
||||
Unfortunately, if INT 12h reports that the amount of memory is too
|
||||
low, there is usually nothing the boot loader can do but to report an
|
||||
error to the user. The boot loader should therefore be designed to
|
||||
take up as little space in low memory as it reasonably can. For
|
||||
zImage or old bzImage kernels, which need data written into the
|
||||
0x90000 segment, the boot loader should make sure not to use memory
|
||||
above the 0x9A000 point; too many BIOSes will break above that point.
|
||||
|
||||
For a modern bzImage kernel with boot protocol version >= 2.02, a
|
||||
memory layout like the following is suggested:
|
||||
|
||||
~ ~
|
||||
| Protected-mode kernel |
|
||||
100000 +------------------------+
|
||||
| I/O memory hole |
|
||||
0A0000 +------------------------+
|
||||
| Reserved for BIOS | Leave as much as possible unused
|
||||
~ ~
|
||||
| Command line | (Can also be below the X+10000 mark)
|
||||
X+10000 +------------------------+
|
||||
| Stack/heap | For use by the kernel real-mode code.
|
||||
X+08000 +------------------------+
|
||||
| Kernel setup | The kernel real-mode code.
|
||||
| Kernel boot sector | The kernel legacy boot sector.
|
||||
X +------------------------+
|
||||
| Boot loader | <- Boot sector entry point 0000:7C00
|
||||
001000 +------------------------+
|
||||
| Reserved for MBR/BIOS |
|
||||
000800 +------------------------+
|
||||
| Typically used by MBR |
|
||||
000600 +------------------------+
|
||||
| BIOS use only |
|
||||
000000 +------------------------+
|
||||
|
||||
... where the address X is as low as the design of the boot loader
|
||||
permits.
|
||||
|
||||
|
||||
**** THE REAL-MODE KERNEL HEADER
|
||||
|
||||
In the following text, and anywhere in the kernel boot sequence, "a
|
||||
sector" refers to 512 bytes. It is independent of the actual sector
|
||||
size of the underlying medium.
|
||||
|
||||
The first step in loading a Linux kernel should be to load the
|
||||
real-mode code (boot sector and setup code) and then examine the
|
||||
following header at offset 0x01f1. The real-mode code can total up to
|
||||
32K, although the boot loader may choose to load only the first two
|
||||
sectors (1K) and then examine the bootup sector size.
|
||||
|
||||
The header looks like:
|
||||
|
||||
Offset Proto Name Meaning
|
||||
/Size
|
||||
|
||||
01F1/1 ALL(1 setup_sects The size of the setup in sectors
|
||||
01F2/2 ALL root_flags If set, the root is mounted readonly
|
||||
01F4/4 2.04+(2 syssize The size of the 32-bit code in 16-byte paras
|
||||
01F8/2 ALL ram_size DO NOT USE - for bootsect.S use only
|
||||
01FA/2 ALL vid_mode Video mode control
|
||||
01FC/2 ALL root_dev Default root device number
|
||||
01FE/2 ALL boot_flag 0xAA55 magic number
|
||||
0200/2 2.00+ jump Jump instruction
|
||||
0202/4 2.00+ header Magic signature "HdrS"
|
||||
0206/2 2.00+ version Boot protocol version supported
|
||||
0208/4 2.00+ realmode_swtch Boot loader hook (see below)
|
||||
020C/2 2.00+ start_sys The load-low segment (0x1000) (obsolete)
|
||||
020E/2 2.00+ kernel_version Pointer to kernel version string
|
||||
0210/1 2.00+ type_of_loader Boot loader identifier
|
||||
0211/1 2.00+ loadflags Boot protocol option flags
|
||||
0212/2 2.00+ setup_move_size Move to high memory size (used with hooks)
|
||||
0214/4 2.00+ code32_start Boot loader hook (see below)
|
||||
0218/4 2.00+ ramdisk_image initrd load address (set by boot loader)
|
||||
021C/4 2.00+ ramdisk_size initrd size (set by boot loader)
|
||||
0220/4 2.00+ bootsect_kludge DO NOT USE - for bootsect.S use only
|
||||
0224/2 2.01+ heap_end_ptr Free memory after setup end
|
||||
0226/2 N/A pad1 Unused
|
||||
0228/4 2.02+ cmd_line_ptr 32-bit pointer to the kernel command line
|
||||
022C/4 2.03+ initrd_addr_max Highest legal initrd address
|
||||
0230/4 2.05+ kernel_alignment Physical addr alignment required for kernel
|
||||
0234/1 2.05+ relocatable_kernel Whether kernel is relocatable or not
|
||||
0235/3 N/A pad2 Unused
|
||||
0238/4 2.06+ cmdline_size Maximum size of the kernel command line
|
||||
023C/4 2.07+ hardware_subarch Hardware subarchitecture
|
||||
0240/8 2.07+ hardware_subarch_data Subarchitecture-specific data
|
||||
0248/4 2.08+ payload_offset Offset of kernel payload
|
||||
024C/4 2.08+ payload_length Length of kernel payload
|
||||
0250/8 2.09+ setup_data 64-bit physical pointer to linked list
|
||||
of struct setup_data
|
||||
|
||||
(1) For backwards compatibility, if the setup_sects field contains 0, the
|
||||
real value is 4.
|
||||
|
||||
(2) For boot protocol prior to 2.04, the upper two bytes of the syssize
|
||||
field are unusable, which means the size of a bzImage kernel
|
||||
cannot be determined.
|
||||
|
||||
If the "HdrS" (0x53726448) magic number is not found at offset 0x202,
|
||||
the boot protocol version is "old". Loading an old kernel, the
|
||||
following parameters should be assumed:
|
||||
|
||||
Image type = zImage
|
||||
initrd not supported
|
||||
Real-mode kernel must be located at 0x90000.
|
||||
|
||||
Otherwise, the "version" field contains the protocol version,
|
||||
e.g. protocol version 2.01 will contain 0x0201 in this field. When
|
||||
setting fields in the header, you must make sure only to set fields
|
||||
supported by the protocol version in use.
|
||||
|
||||
|
||||
**** DETAILS OF HEADER FIELDS
|
||||
|
||||
For each field, some are information from the kernel to the bootloader
|
||||
("read"), some are expected to be filled out by the bootloader
|
||||
("write"), and some are expected to be read and modified by the
|
||||
bootloader ("modify").
|
||||
|
||||
All general purpose boot loaders should write the fields marked
|
||||
(obligatory). Boot loaders who want to load the kernel at a
|
||||
nonstandard address should fill in the fields marked (reloc); other
|
||||
boot loaders can ignore those fields.
|
||||
|
||||
The byte order of all fields is littleendian (this is x86, after all.)
|
||||
|
||||
Field name: setup_sects
|
||||
Type: read
|
||||
Offset/size: 0x1f1/1
|
||||
Protocol: ALL
|
||||
|
||||
The size of the setup code in 512-byte sectors. If this field is
|
||||
0, the real value is 4. The real-mode code consists of the boot
|
||||
sector (always one 512-byte sector) plus the setup code.
|
||||
|
||||
Field name: root_flags
|
||||
Type: modify (optional)
|
||||
Offset/size: 0x1f2/2
|
||||
Protocol: ALL
|
||||
|
||||
If this field is nonzero, the root defaults to readonly. The use of
|
||||
this field is deprecated; use the "ro" or "rw" options on the
|
||||
command line instead.
|
||||
|
||||
Field name: syssize
|
||||
Type: read
|
||||
Offset/size: 0x1f4/4 (protocol 2.04+) 0x1f4/2 (protocol ALL)
|
||||
Protocol: 2.04+
|
||||
|
||||
The size of the protected-mode code in units of 16-byte paragraphs.
|
||||
For protocol versions older than 2.04 this field is only two bytes
|
||||
wide, and therefore cannot be trusted for the size of a kernel if
|
||||
the LOAD_HIGH flag is set.
|
||||
|
||||
Field name: ram_size
|
||||
Type: kernel internal
|
||||
Offset/size: 0x1f8/2
|
||||
Protocol: ALL
|
||||
|
||||
This field is obsolete.
|
||||
|
||||
Field name: vid_mode
|
||||
Type: modify (obligatory)
|
||||
Offset/size: 0x1fa/2
|
||||
|
||||
Please see the section on SPECIAL COMMAND LINE OPTIONS.
|
||||
|
||||
Field name: root_dev
|
||||
Type: modify (optional)
|
||||
Offset/size: 0x1fc/2
|
||||
Protocol: ALL
|
||||
|
||||
The default root device device number. The use of this field is
|
||||
deprecated, use the "root=" option on the command line instead.
|
||||
|
||||
Field name: boot_flag
|
||||
Type: read
|
||||
Offset/size: 0x1fe/2
|
||||
Protocol: ALL
|
||||
|
||||
Contains 0xAA55. This is the closest thing old Linux kernels have
|
||||
to a magic number.
|
||||
|
||||
Field name: jump
|
||||
Type: read
|
||||
Offset/size: 0x200/2
|
||||
Protocol: 2.00+
|
||||
|
||||
Contains an x86 jump instruction, 0xEB followed by a signed offset
|
||||
relative to byte 0x202. This can be used to determine the size of
|
||||
the header.
|
||||
|
||||
Field name: header
|
||||
Type: read
|
||||
Offset/size: 0x202/4
|
||||
Protocol: 2.00+
|
||||
|
||||
Contains the magic number "HdrS" (0x53726448).
|
||||
|
||||
Field name: version
|
||||
Type: read
|
||||
Offset/size: 0x206/2
|
||||
Protocol: 2.00+
|
||||
|
||||
Contains the boot protocol version, in (major << 8)+minor format,
|
||||
e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version
|
||||
10.17.
|
||||
|
||||
Field name: readmode_swtch
|
||||
Type: modify (optional)
|
||||
Offset/size: 0x208/4
|
||||
Protocol: 2.00+
|
||||
|
||||
Boot loader hook (see ADVANCED BOOT LOADER HOOKS below.)
|
||||
|
||||
Field name: start_sys
|
||||
Type: read
|
||||
Offset/size: 0x20c/4
|
||||
Protocol: 2.00+
|
||||
|
||||
The load low segment (0x1000). Obsolete.
|
||||
|
||||
Field name: kernel_version
|
||||
Type: read
|
||||
Offset/size: 0x20e/2
|
||||
Protocol: 2.00+
|
||||
|
||||
If set to a nonzero value, contains a pointer to a NUL-terminated
|
||||
human-readable kernel version number string, less 0x200. This can
|
||||
be used to display the kernel version to the user. This value
|
||||
should be less than (0x200*setup_sects).
|
||||
|
||||
For example, if this value is set to 0x1c00, the kernel version
|
||||
number string can be found at offset 0x1e00 in the kernel file.
|
||||
This is a valid value if and only if the "setup_sects" field
|
||||
contains the value 15 or higher, as:
|
||||
|
||||
0x1c00 < 15*0x200 (= 0x1e00) but
|
||||
0x1c00 >= 14*0x200 (= 0x1c00)
|
||||
|
||||
0x1c00 >> 9 = 14, so the minimum value for setup_secs is 15.
|
||||
|
||||
Field name: type_of_loader
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x210/1
|
||||
Protocol: 2.00+
|
||||
|
||||
If your boot loader has an assigned id (see table below), enter
|
||||
0xTV here, where T is an identifier for the boot loader and V is
|
||||
a version number. Otherwise, enter 0xFF here.
|
||||
|
||||
Assigned boot loader ids:
|
||||
0 LILO (0x00 reserved for pre-2.00 bootloader)
|
||||
1 Loadlin
|
||||
2 bootsect-loader (0x20, all other values reserved)
|
||||
3 SYSLINUX
|
||||
4 EtherBoot
|
||||
5 ELILO
|
||||
7 GRuB
|
||||
8 U-BOOT
|
||||
9 Xen
|
||||
A Gujin
|
||||
B Qemu
|
||||
|
||||
Please contact <hpa@zytor.com> if you need a bootloader ID
|
||||
value assigned.
|
||||
|
||||
Field name: loadflags
|
||||
Type: modify (obligatory)
|
||||
Offset/size: 0x211/1
|
||||
Protocol: 2.00+
|
||||
|
||||
This field is a bitmask.
|
||||
|
||||
Bit 0 (read): LOADED_HIGH
|
||||
- If 0, the protected-mode code is loaded at 0x10000.
|
||||
- If 1, the protected-mode code is loaded at 0x100000.
|
||||
|
||||
Bit 5 (write): QUIET_FLAG
|
||||
- If 0, print early messages.
|
||||
- If 1, suppress early messages.
|
||||
This requests to the kernel (decompressor and early
|
||||
kernel) to not write early messages that require
|
||||
accessing the display hardware directly.
|
||||
|
||||
Bit 6 (write): KEEP_SEGMENTS
|
||||
Protocol: 2.07+
|
||||
- If 0, reload the segment registers in the 32bit entry point.
|
||||
- If 1, do not reload the segment registers in the 32bit entry point.
|
||||
Assume that %cs %ds %ss %es are all set to flat segments with
|
||||
a base of 0 (or the equivalent for their environment).
|
||||
|
||||
Bit 7 (write): CAN_USE_HEAP
|
||||
Set this bit to 1 to indicate that the value entered in the
|
||||
heap_end_ptr is valid. If this field is clear, some setup code
|
||||
functionality will be disabled.
|
||||
|
||||
Field name: setup_move_size
|
||||
Type: modify (obligatory)
|
||||
Offset/size: 0x212/2
|
||||
Protocol: 2.00-2.01
|
||||
|
||||
When using protocol 2.00 or 2.01, if the real mode kernel is not
|
||||
loaded at 0x90000, it gets moved there later in the loading
|
||||
sequence. Fill in this field if you want additional data (such as
|
||||
the kernel command line) moved in addition to the real-mode kernel
|
||||
itself.
|
||||
|
||||
The unit is bytes starting with the beginning of the boot sector.
|
||||
|
||||
This field is can be ignored when the protocol is 2.02 or higher, or
|
||||
if the real-mode code is loaded at 0x90000.
|
||||
|
||||
Field name: code32_start
|
||||
Type: modify (optional, reloc)
|
||||
Offset/size: 0x214/4
|
||||
Protocol: 2.00+
|
||||
|
||||
The address to jump to in protected mode. This defaults to the load
|
||||
address of the kernel, and can be used by the boot loader to
|
||||
determine the proper load address.
|
||||
|
||||
This field can be modified for two purposes:
|
||||
|
||||
1. as a boot loader hook (see ADVANCED BOOT LOADER HOOKS below.)
|
||||
|
||||
2. if a bootloader which does not install a hook loads a
|
||||
relocatable kernel at a nonstandard address it will have to modify
|
||||
this field to point to the load address.
|
||||
|
||||
Field name: ramdisk_image
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x218/4
|
||||
Protocol: 2.00+
|
||||
|
||||
The 32-bit linear address of the initial ramdisk or ramfs. Leave at
|
||||
zero if there is no initial ramdisk/ramfs.
|
||||
|
||||
Field name: ramdisk_size
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x21c/4
|
||||
Protocol: 2.00+
|
||||
|
||||
Size of the initial ramdisk or ramfs. Leave at zero if there is no
|
||||
initial ramdisk/ramfs.
|
||||
|
||||
Field name: bootsect_kludge
|
||||
Type: kernel internal
|
||||
Offset/size: 0x220/4
|
||||
Protocol: 2.00+
|
||||
|
||||
This field is obsolete.
|
||||
|
||||
Field name: heap_end_ptr
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x224/2
|
||||
Protocol: 2.01+
|
||||
|
||||
Set this field to the offset (from the beginning of the real-mode
|
||||
code) of the end of the setup stack/heap, minus 0x0200.
|
||||
|
||||
Field name: cmd_line_ptr
|
||||
Type: write (obligatory)
|
||||
Offset/size: 0x228/4
|
||||
Protocol: 2.02+
|
||||
|
||||
Set this field to the linear address of the kernel command line.
|
||||
The kernel command line can be located anywhere between the end of
|
||||
the setup heap and 0xA0000; it does not have to be located in the
|
||||
same 64K segment as the real-mode code itself.
|
||||
|
||||
Fill in this field even if your boot loader does not support a
|
||||
command line, in which case you can point this to an empty string
|
||||
(or better yet, to the string "auto".) If this field is left at
|
||||
zero, the kernel will assume that your boot loader does not support
|
||||
the 2.02+ protocol.
|
||||
|
||||
Field name: initrd_addr_max
|
||||
Type: read
|
||||
Offset/size: 0x22c/4
|
||||
Protocol: 2.03+
|
||||
|
||||
The maximum address that may be occupied by the initial
|
||||
ramdisk/ramfs contents. For boot protocols 2.02 or earlier, this
|
||||
field is not present, and the maximum address is 0x37FFFFFF. (This
|
||||
address is defined as the address of the highest safe byte, so if
|
||||
your ramdisk is exactly 131072 bytes long and this field is
|
||||
0x37FFFFFF, you can start your ramdisk at 0x37FE0000.)
|
||||
|
||||
Field name: kernel_alignment
|
||||
Type: read (reloc)
|
||||
Offset/size: 0x230/4
|
||||
Protocol: 2.05+
|
||||
|
||||
Alignment unit required by the kernel (if relocatable_kernel is true.)
|
||||
|
||||
Field name: relocatable_kernel
|
||||
Type: read (reloc)
|
||||
Offset/size: 0x234/1
|
||||
Protocol: 2.05+
|
||||
|
||||
If this field is nonzero, the protected-mode part of the kernel can
|
||||
be loaded at any address that satisfies the kernel_alignment field.
|
||||
After loading, the boot loader must set the code32_start field to
|
||||
point to the loaded code, or to a boot loader hook.
|
||||
|
||||
Field name: cmdline_size
|
||||
Type: read
|
||||
Offset/size: 0x238/4
|
||||
Protocol: 2.06+
|
||||
|
||||
The maximum size of the command line without the terminating
|
||||
zero. This means that the command line can contain at most
|
||||
cmdline_size characters. With protocol version 2.05 and earlier, the
|
||||
maximum size was 255.
|
||||
|
||||
Field name: hardware_subarch
|
||||
Type: write (optional, defaults to x86/PC)
|
||||
Offset/size: 0x23c/4
|
||||
Protocol: 2.07+
|
||||
|
||||
In a paravirtualized environment the hardware low level architectural
|
||||
pieces such as interrupt handling, page table handling, and
|
||||
accessing process control registers needs to be done differently.
|
||||
|
||||
This field allows the bootloader to inform the kernel we are in one
|
||||
one of those environments.
|
||||
|
||||
0x00000000 The default x86/PC environment
|
||||
0x00000001 lguest
|
||||
0x00000002 Xen
|
||||
|
||||
Field name: hardware_subarch_data
|
||||
Type: write (subarch-dependent)
|
||||
Offset/size: 0x240/8
|
||||
Protocol: 2.07+
|
||||
|
||||
A pointer to data that is specific to hardware subarch
|
||||
This field is currently unused for the default x86/PC environment,
|
||||
do not modify.
|
||||
|
||||
Field name: payload_offset
|
||||
Type: read
|
||||
Offset/size: 0x248/4
|
||||
Protocol: 2.08+
|
||||
|
||||
If non-zero then this field contains the offset from the end of the
|
||||
real-mode code to the payload.
|
||||
|
||||
The payload may be compressed. The format of both the compressed and
|
||||
uncompressed data should be determined using the standard magic
|
||||
numbers. Currently only gzip compressed ELF is used.
|
||||
|
||||
Field name: payload_length
|
||||
Type: read
|
||||
Offset/size: 0x24c/4
|
||||
Protocol: 2.08+
|
||||
|
||||
The length of the payload.
|
||||
|
||||
Field name: setup_data
|
||||
Type: write (special)
|
||||
Offset/size: 0x250/8
|
||||
Protocol: 2.09+
|
||||
|
||||
The 64-bit physical pointer to NULL terminated single linked list of
|
||||
struct setup_data. This is used to define a more extensible boot
|
||||
parameters passing mechanism. The definition of struct setup_data is
|
||||
as follow:
|
||||
|
||||
struct setup_data {
|
||||
u64 next;
|
||||
u32 type;
|
||||
u32 len;
|
||||
u8 data[0];
|
||||
};
|
||||
|
||||
Where, the next is a 64-bit physical pointer to the next node of
|
||||
linked list, the next field of the last node is 0; the type is used
|
||||
to identify the contents of data; the len is the length of data
|
||||
field; the data holds the real payload.
|
||||
|
||||
This list may be modified at a number of points during the bootup
|
||||
process. Therefore, when modifying this list one should always make
|
||||
sure to consider the case where the linked list already contains
|
||||
entries.
|
||||
|
||||
|
||||
**** THE IMAGE CHECKSUM
|
||||
|
||||
From boot protocol version 2.08 onwards the CRC-32 is calculated over
|
||||
the entire file using the characteristic polynomial 0x04C11DB7 and an
|
||||
initial remainder of 0xffffffff. The checksum is appended to the
|
||||
file; therefore the CRC of the file up to the limit specified in the
|
||||
syssize field of the header is always 0.
|
||||
|
||||
|
||||
**** THE KERNEL COMMAND LINE
|
||||
|
||||
The kernel command line has become an important way for the boot
|
||||
loader to communicate with the kernel. Some of its options are also
|
||||
relevant to the boot loader itself, see "special command line options"
|
||||
below.
|
||||
|
||||
The kernel command line is a null-terminated string. The maximum
|
||||
length can be retrieved from the field cmdline_size. Before protocol
|
||||
version 2.06, the maximum was 255 characters. A string that is too
|
||||
long will be automatically truncated by the kernel.
|
||||
|
||||
If the boot protocol version is 2.02 or later, the address of the
|
||||
kernel command line is given by the header field cmd_line_ptr (see
|
||||
above.) This address can be anywhere between the end of the setup
|
||||
heap and 0xA0000.
|
||||
|
||||
If the protocol version is *not* 2.02 or higher, the kernel
|
||||
command line is entered using the following protocol:
|
||||
|
||||
At offset 0x0020 (word), "cmd_line_magic", enter the magic
|
||||
number 0xA33F.
|
||||
|
||||
At offset 0x0022 (word), "cmd_line_offset", enter the offset
|
||||
of the kernel command line (relative to the start of the
|
||||
real-mode kernel).
|
||||
|
||||
The kernel command line *must* be within the memory region
|
||||
covered by setup_move_size, so you may need to adjust this
|
||||
field.
|
||||
|
||||
|
||||
**** MEMORY LAYOUT OF THE REAL-MODE CODE
|
||||
|
||||
The real-mode code requires a stack/heap to be set up, as well as
|
||||
memory allocated for the kernel command line. This needs to be done
|
||||
in the real-mode accessible memory in bottom megabyte.
|
||||
|
||||
It should be noted that modern machines often have a sizable Extended
|
||||
BIOS Data Area (EBDA). As a result, it is advisable to use as little
|
||||
of the low megabyte as possible.
|
||||
|
||||
Unfortunately, under the following circumstances the 0x90000 memory
|
||||
segment has to be used:
|
||||
|
||||
- When loading a zImage kernel ((loadflags & 0x01) == 0).
|
||||
- When loading a 2.01 or earlier boot protocol kernel.
|
||||
|
||||
-> For the 2.00 and 2.01 boot protocols, the real-mode code
|
||||
can be loaded at another address, but it is internally
|
||||
relocated to 0x90000. For the "old" protocol, the
|
||||
real-mode code must be loaded at 0x90000.
|
||||
|
||||
When loading at 0x90000, avoid using memory above 0x9a000.
|
||||
|
||||
For boot protocol 2.02 or higher, the command line does not have to be
|
||||
located in the same 64K segment as the real-mode setup code; it is
|
||||
thus permitted to give the stack/heap the full 64K segment and locate
|
||||
the command line above it.
|
||||
|
||||
The kernel command line should not be located below the real-mode
|
||||
code, nor should it be located in high memory.
|
||||
|
||||
|
||||
**** SAMPLE BOOT CONFIGURATION
|
||||
|
||||
As a sample configuration, assume the following layout of the real
|
||||
mode segment:
|
||||
|
||||
When loading below 0x90000, use the entire segment:
|
||||
|
||||
0x0000-0x7fff Real mode kernel
|
||||
0x8000-0xdfff Stack and heap
|
||||
0xe000-0xffff Kernel command line
|
||||
|
||||
When loading at 0x90000 OR the protocol version is 2.01 or earlier:
|
||||
|
||||
0x0000-0x7fff Real mode kernel
|
||||
0x8000-0x97ff Stack and heap
|
||||
0x9800-0x9fff Kernel command line
|
||||
|
||||
Such a boot loader should enter the following fields in the header:
|
||||
|
||||
unsigned long base_ptr; /* base address for real-mode segment */
|
||||
|
||||
if ( setup_sects == 0 ) {
|
||||
setup_sects = 4;
|
||||
}
|
||||
|
||||
if ( protocol >= 0x0200 ) {
|
||||
type_of_loader = <type code>;
|
||||
if ( loading_initrd ) {
|
||||
ramdisk_image = <initrd_address>;
|
||||
ramdisk_size = <initrd_size>;
|
||||
}
|
||||
|
||||
if ( protocol >= 0x0202 && loadflags & 0x01 )
|
||||
heap_end = 0xe000;
|
||||
else
|
||||
heap_end = 0x9800;
|
||||
|
||||
if ( protocol >= 0x0201 ) {
|
||||
heap_end_ptr = heap_end - 0x200;
|
||||
loadflags |= 0x80; /* CAN_USE_HEAP */
|
||||
}
|
||||
|
||||
if ( protocol >= 0x0202 ) {
|
||||
cmd_line_ptr = base_ptr + heap_end;
|
||||
strcpy(cmd_line_ptr, cmdline);
|
||||
} else {
|
||||
cmd_line_magic = 0xA33F;
|
||||
cmd_line_offset = heap_end;
|
||||
setup_move_size = heap_end + strlen(cmdline)+1;
|
||||
strcpy(base_ptr+cmd_line_offset, cmdline);
|
||||
}
|
||||
} else {
|
||||
/* Very old kernel */
|
||||
|
||||
heap_end = 0x9800;
|
||||
|
||||
cmd_line_magic = 0xA33F;
|
||||
cmd_line_offset = heap_end;
|
||||
|
||||
/* A very old kernel MUST have its real-mode code
|
||||
loaded at 0x90000 */
|
||||
|
||||
if ( base_ptr != 0x90000 ) {
|
||||
/* Copy the real-mode kernel */
|
||||
memcpy(0x90000, base_ptr, (setup_sects+1)*512);
|
||||
base_ptr = 0x90000; /* Relocated */
|
||||
}
|
||||
|
||||
strcpy(0x90000+cmd_line_offset, cmdline);
|
||||
|
||||
/* It is recommended to clear memory up to the 32K mark */
|
||||
memset(0x90000 + (setup_sects+1)*512, 0,
|
||||
(64-(setup_sects+1))*512);
|
||||
}
|
||||
|
||||
|
||||
**** LOADING THE REST OF THE KERNEL
|
||||
|
||||
The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512
|
||||
in the kernel file (again, if setup_sects == 0 the real value is 4.)
|
||||
It should be loaded at address 0x10000 for Image/zImage kernels and
|
||||
0x100000 for bzImage kernels.
|
||||
|
||||
The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01
|
||||
bit (LOAD_HIGH) in the loadflags field is set:
|
||||
|
||||
is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01);
|
||||
load_address = is_bzImage ? 0x100000 : 0x10000;
|
||||
|
||||
Note that Image/zImage kernels can be up to 512K in size, and thus use
|
||||
the entire 0x10000-0x90000 range of memory. This means it is pretty
|
||||
much a requirement for these kernels to load the real-mode part at
|
||||
0x90000. bzImage kernels allow much more flexibility.
|
||||
|
||||
|
||||
**** SPECIAL COMMAND LINE OPTIONS
|
||||
|
||||
If the command line provided by the boot loader is entered by the
|
||||
user, the user may expect the following command line options to work.
|
||||
They should normally not be deleted from the kernel command line even
|
||||
though not all of them are actually meaningful to the kernel. Boot
|
||||
loader authors who need additional command line options for the boot
|
||||
loader itself should get them registered in
|
||||
Documentation/kernel-parameters.txt to make sure they will not
|
||||
conflict with actual kernel options now or in the future.
|
||||
|
||||
vga=<mode>
|
||||
<mode> here is either an integer (in C notation, either
|
||||
decimal, octal, or hexadecimal) or one of the strings
|
||||
"normal" (meaning 0xFFFF), "ext" (meaning 0xFFFE) or "ask"
|
||||
(meaning 0xFFFD). This value should be entered into the
|
||||
vid_mode field, as it is used by the kernel before the command
|
||||
line is parsed.
|
||||
|
||||
mem=<size>
|
||||
<size> is an integer in C notation optionally followed by
|
||||
(case insensitive) K, M, G, T, P or E (meaning << 10, << 20,
|
||||
<< 30, << 40, << 50 or << 60). This specifies the end of
|
||||
memory to the kernel. This affects the possible placement of
|
||||
an initrd, since an initrd should be placed near end of
|
||||
memory. Note that this is an option to *both* the kernel and
|
||||
the bootloader!
|
||||
|
||||
initrd=<file>
|
||||
An initrd should be loaded. The meaning of <file> is
|
||||
obviously bootloader-dependent, and some boot loaders
|
||||
(e.g. LILO) do not have such a command.
|
||||
|
||||
In addition, some boot loaders add the following options to the
|
||||
user-specified command line:
|
||||
|
||||
BOOT_IMAGE=<file>
|
||||
The boot image which was loaded. Again, the meaning of <file>
|
||||
is obviously bootloader-dependent.
|
||||
|
||||
auto
|
||||
The kernel was booted without explicit user intervention.
|
||||
|
||||
If these options are added by the boot loader, it is highly
|
||||
recommended that they are located *first*, before the user-specified
|
||||
or configuration-specified command line. Otherwise, "init=/bin/sh"
|
||||
gets confused by the "auto" option.
|
||||
|
||||
|
||||
**** RUNNING THE KERNEL
|
||||
|
||||
The kernel is started by jumping to the kernel entry point, which is
|
||||
located at *segment* offset 0x20 from the start of the real mode
|
||||
kernel. This means that if you loaded your real-mode kernel code at
|
||||
0x90000, the kernel entry point is 9020:0000.
|
||||
|
||||
At entry, ds = es = ss should point to the start of the real-mode
|
||||
kernel code (0x9000 if the code is loaded at 0x90000), sp should be
|
||||
set up properly, normally pointing to the top of the heap, and
|
||||
interrupts should be disabled. Furthermore, to guard against bugs in
|
||||
the kernel, it is recommended that the boot loader sets fs = gs = ds =
|
||||
es = ss.
|
||||
|
||||
In our example from above, we would do:
|
||||
|
||||
/* Note: in the case of the "old" kernel protocol, base_ptr must
|
||||
be == 0x90000 at this point; see the previous sample code */
|
||||
|
||||
seg = base_ptr >> 4;
|
||||
|
||||
cli(); /* Enter with interrupts disabled! */
|
||||
|
||||
/* Set up the real-mode kernel stack */
|
||||
_SS = seg;
|
||||
_SP = heap_end;
|
||||
|
||||
_DS = _ES = _FS = _GS = seg;
|
||||
jmp_far(seg+0x20, 0); /* Run the kernel */
|
||||
|
||||
If your boot sector accesses a floppy drive, it is recommended to
|
||||
switch off the floppy motor before running the kernel, since the
|
||||
kernel boot leaves interrupts off and thus the motor will not be
|
||||
switched off, especially if the loaded kernel has the floppy driver as
|
||||
a demand-loaded module!
|
||||
|
||||
|
||||
**** ADVANCED BOOT LOADER HOOKS
|
||||
|
||||
If the boot loader runs in a particularly hostile environment (such as
|
||||
LOADLIN, which runs under DOS) it may be impossible to follow the
|
||||
standard memory location requirements. Such a boot loader may use the
|
||||
following hooks that, if set, are invoked by the kernel at the
|
||||
appropriate time. The use of these hooks should probably be
|
||||
considered an absolutely last resort!
|
||||
|
||||
IMPORTANT: All the hooks are required to preserve %esp, %ebp, %esi and
|
||||
%edi across invocation.
|
||||
|
||||
realmode_swtch:
|
||||
A 16-bit real mode far subroutine invoked immediately before
|
||||
entering protected mode. The default routine disables NMI, so
|
||||
your routine should probably do so, too.
|
||||
|
||||
code32_start:
|
||||
A 32-bit flat-mode routine *jumped* to immediately after the
|
||||
transition to protected mode, but before the kernel is
|
||||
uncompressed. No segments, except CS, are guaranteed to be
|
||||
set up (current kernels do, but older ones do not); you should
|
||||
set them up to BOOT_DS (0x18) yourself.
|
||||
|
||||
After completing your hook, you should jump to the address
|
||||
that was in this field before your boot loader overwrote it
|
||||
(relocated, if appropriate.)
|
||||
|
||||
|
||||
**** 32-bit BOOT PROTOCOL
|
||||
|
||||
For machine with some new BIOS other than legacy BIOS, such as EFI,
|
||||
LinuxBIOS, etc, and kexec, the 16-bit real mode setup code in kernel
|
||||
based on legacy BIOS can not be used, so a 32-bit boot protocol needs
|
||||
to be defined.
|
||||
|
||||
In 32-bit boot protocol, the first step in loading a Linux kernel
|
||||
should be to setup the boot parameters (struct boot_params,
|
||||
traditionally known as "zero page"). The memory for struct boot_params
|
||||
should be allocated and initialized to all zero. Then the setup header
|
||||
from offset 0x01f1 of kernel image on should be loaded into struct
|
||||
boot_params and examined. The end of setup header can be calculated as
|
||||
follow:
|
||||
|
||||
0x0202 + byte value at offset 0x0201
|
||||
|
||||
In addition to read/modify/write the setup header of the struct
|
||||
boot_params as that of 16-bit boot protocol, the boot loader should
|
||||
also fill the additional fields of the struct boot_params as that
|
||||
described in zero-page.txt.
|
||||
|
||||
After setupping the struct boot_params, the boot loader can load the
|
||||
32/64-bit kernel in the same way as that of 16-bit boot protocol.
|
||||
|
||||
In 32-bit boot protocol, the kernel is started by jumping to the
|
||||
32-bit kernel entry point, which is the start address of loaded
|
||||
32/64-bit kernel.
|
||||
|
||||
At entry, the CPU must be in 32-bit protected mode with paging
|
||||
disabled; a GDT must be loaded with the descriptors for selectors
|
||||
__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat
|
||||
segment; __BOOS_CS must have execute/read permission, and __BOOT_DS
|
||||
must have read/write permission; CS must be __BOOT_CS and DS, ES, SS
|
||||
must be __BOOT_DS; interrupt must be disabled; %esi must hold the base
|
||||
address of the struct boot_params; %ebp, %edi and %ebx must be zero.
|
44
Documentation/x86/i386/usb-legacy-support.txt
Normal file
44
Documentation/x86/i386/usb-legacy-support.txt
Normal file
@@ -0,0 +1,44 @@
|
||||
USB Legacy support
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Vojtech Pavlik <vojtech@suse.cz>, January 2004
|
||||
|
||||
|
||||
Also known as "USB Keyboard" or "USB Mouse support" in the BIOS Setup is a
|
||||
feature that allows one to use the USB mouse and keyboard as if they were
|
||||
their classic PS/2 counterparts. This means one can use an USB keyboard to
|
||||
type in LILO for example.
|
||||
|
||||
It has several drawbacks, though:
|
||||
|
||||
1) On some machines, the emulated PS/2 mouse takes over even when no USB
|
||||
mouse is present and a real PS/2 mouse is present. In that case the extra
|
||||
features (wheel, extra buttons, touchpad mode) of the real PS/2 mouse may
|
||||
not be available.
|
||||
|
||||
2) If CONFIG_HIGHMEM64G is enabled, the PS/2 mouse emulation can cause
|
||||
system crashes, because the SMM BIOS is not expecting to be in PAE mode.
|
||||
The Intel E7505 is a typical machine where this happens.
|
||||
|
||||
3) If AMD64 64-bit mode is enabled, again system crashes often happen,
|
||||
because the SMM BIOS isn't expecting the CPU to be in 64-bit mode. The
|
||||
BIOS manufacturers only test with Windows, and Windows doesn't do 64-bit
|
||||
yet.
|
||||
|
||||
Solutions:
|
||||
|
||||
Problem 1) can be solved by loading the USB drivers prior to loading the
|
||||
PS/2 mouse driver. Since the PS/2 mouse driver is in 2.6 compiled into
|
||||
the kernel unconditionally, this means the USB drivers need to be
|
||||
compiled-in, too.
|
||||
|
||||
Problem 2) can currently only be solved by either disabling HIGHMEM64G
|
||||
in the kernel config or USB Legacy support in the BIOS. A BIOS update
|
||||
could help, but so far no such update exists.
|
||||
|
||||
Problem 3) is usually fixed by a BIOS update. Check the board
|
||||
manufacturers web site. If an update is not available, disable USB
|
||||
Legacy support in the BIOS. If this alone doesn't help, try also adding
|
||||
idle=poll on the kernel command line. The BIOS may be entering the SMM
|
||||
on the HLT instruction as well.
|
||||
|
31
Documentation/x86/i386/zero-page.txt
Normal file
31
Documentation/x86/i386/zero-page.txt
Normal file
@@ -0,0 +1,31 @@
|
||||
The additional fields in struct boot_params as a part of 32-bit boot
|
||||
protocol of kernel. These should be filled by bootloader or 16-bit
|
||||
real-mode setup code of the kernel. References/settings to it mainly
|
||||
are in:
|
||||
|
||||
include/asm-x86/bootparam.h
|
||||
|
||||
|
||||
Offset Proto Name Meaning
|
||||
/Size
|
||||
|
||||
000/040 ALL screen_info Text mode or frame buffer information
|
||||
(struct screen_info)
|
||||
040/014 ALL apm_bios_info APM BIOS information (struct apm_bios_info)
|
||||
060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information
|
||||
(struct ist_info)
|
||||
080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!!
|
||||
090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!!
|
||||
0A0/010 ALL sys_desc_table System description table (struct sys_desc_table)
|
||||
140/080 ALL edid_info Video mode setup (struct edid_info)
|
||||
1C0/020 ALL efi_info EFI 32 information (struct efi_info)
|
||||
1E0/004 ALL alk_mem_k Alternative mem check, in KB
|
||||
1E4/004 ALL scratch Scratch field for the kernel setup code
|
||||
1E8/001 ALL e820_entries Number of entries in e820_map (below)
|
||||
1E9/001 ALL eddbuf_entries Number of entries in eddbuf (below)
|
||||
1EA/001 ALL edd_mbr_sig_buf_entries Number of entries in edd_mbr_sig_buffer
|
||||
(below)
|
||||
290/040 ALL edd_mbr_sig_buffer EDD MBR signatures
|
||||
2D0/A00 ALL e820_map E820 memory map table
|
||||
(array of struct e820entry)
|
||||
D00/1EC ALL eddbuf EDD data (array of struct edd_info)
|
16
Documentation/x86/x86_64/00-INDEX
Normal file
16
Documentation/x86/x86_64/00-INDEX
Normal file
@@ -0,0 +1,16 @@
|
||||
00-INDEX
|
||||
- This file
|
||||
boot-options.txt
|
||||
- AMD64-specific boot options.
|
||||
cpu-hotplug-spec
|
||||
- Firmware support for CPU hotplug under Linux/x86-64
|
||||
fake-numa-for-cpusets
|
||||
- Using numa=fake and CPUSets for Resource Management
|
||||
kernel-stacks
|
||||
- Context-specific per-processor interrupt stacks.
|
||||
machinecheck
|
||||
- Configurable sysfs parameters for the x86-64 machine check code.
|
||||
mm.txt
|
||||
- Memory layout of x86-64 (4 level page tables, 46 bits physical).
|
||||
uefi.txt
|
||||
- Booting Linux via Unified Extensible Firmware Interface.
|
314
Documentation/x86/x86_64/boot-options.txt
Normal file
314
Documentation/x86/x86_64/boot-options.txt
Normal file
@@ -0,0 +1,314 @@
|
||||
AMD64 specific boot options
|
||||
|
||||
There are many others (usually documented in driver documentation), but
|
||||
only the AMD64 specific ones are listed here.
|
||||
|
||||
Machine check
|
||||
|
||||
mce=off disable machine check
|
||||
mce=bootlog Enable logging of machine checks left over from booting.
|
||||
Disabled by default on AMD because some BIOS leave bogus ones.
|
||||
If your BIOS doesn't do that it's a good idea to enable though
|
||||
to make sure you log even machine check events that result
|
||||
in a reboot. On Intel systems it is enabled by default.
|
||||
mce=nobootlog
|
||||
Disable boot machine check logging.
|
||||
mce=tolerancelevel (number)
|
||||
0: always panic on uncorrected errors, log corrected errors
|
||||
1: panic or SIGBUS on uncorrected errors, log corrected errors
|
||||
2: SIGBUS or log uncorrected errors, log corrected errors
|
||||
3: never panic or SIGBUS, log all errors (for testing only)
|
||||
Default is 1
|
||||
Can be also set using sysfs which is preferable.
|
||||
|
||||
nomce (for compatibility with i386): same as mce=off
|
||||
|
||||
Everything else is in sysfs now.
|
||||
|
||||
APICs
|
||||
|
||||
apic Use IO-APIC. Default
|
||||
|
||||
noapic Don't use the IO-APIC.
|
||||
|
||||
disableapic Don't use the local APIC
|
||||
|
||||
nolapic Don't use the local APIC (alias for i386 compatibility)
|
||||
|
||||
pirq=... See Documentation/i386/IO-APIC.txt
|
||||
|
||||
noapictimer Don't set up the APIC timer
|
||||
|
||||
no_timer_check Don't check the IO-APIC timer. This can work around
|
||||
problems with incorrect timer initialization on some boards.
|
||||
|
||||
apicmaintimer Run time keeping from the local APIC timer instead
|
||||
of using the PIT/HPET interrupt for this. This is useful
|
||||
when the PIT/HPET interrupts are unreliable.
|
||||
|
||||
noapicmaintimer Don't do time keeping using the APIC timer.
|
||||
Useful when this option was auto selected, but doesn't work.
|
||||
|
||||
apicpmtimer
|
||||
Do APIC timer calibration using the pmtimer. Implies
|
||||
apicmaintimer. Useful when your PIT timer is totally
|
||||
broken.
|
||||
|
||||
disable_8254_timer / enable_8254_timer
|
||||
Enable interrupt 0 timer routing over the 8254 in addition to over
|
||||
the IO-APIC. The kernel tries to set a sensible default.
|
||||
|
||||
Early Console
|
||||
|
||||
syntax: earlyprintk=vga
|
||||
earlyprintk=serial[,ttySn[,baudrate]]
|
||||
|
||||
The early console is useful when the kernel crashes before the
|
||||
normal console is initialized. It is not enabled by
|
||||
default because it has some cosmetic problems.
|
||||
Append ,keep to not disable it when the real console takes over.
|
||||
Only vga or serial at a time, not both.
|
||||
Currently only ttyS0 and ttyS1 are supported.
|
||||
Interaction with the standard serial driver is not very good.
|
||||
The VGA output is eventually overwritten by the real console.
|
||||
|
||||
Timing
|
||||
|
||||
notsc
|
||||
Don't use the CPU time stamp counter to read the wall time.
|
||||
This can be used to work around timing problems on multiprocessor systems
|
||||
with not properly synchronized CPUs.
|
||||
|
||||
report_lost_ticks
|
||||
Report when timer interrupts are lost because some code turned off
|
||||
interrupts for too long.
|
||||
|
||||
nmi_watchdog=NUMBER[,panic]
|
||||
NUMBER can be:
|
||||
0 don't use an NMI watchdog
|
||||
1 use the IO-APIC timer for the NMI watchdog
|
||||
2 use the local APIC for the NMI watchdog using a performance counter. Note
|
||||
This will use one performance counter and the local APIC's performance
|
||||
vector.
|
||||
When panic is specified panic when an NMI watchdog timeout occurs.
|
||||
This is useful when you use a panic=... timeout and need the box
|
||||
quickly up again.
|
||||
|
||||
nohpet
|
||||
Don't use the HPET timer.
|
||||
|
||||
Idle loop
|
||||
|
||||
idle=poll
|
||||
Don't do power saving in the idle loop using HLT, but poll for rescheduling
|
||||
event. This will make the CPUs eat a lot more power, but may be useful
|
||||
to get slightly better performance in multiprocessor benchmarks. It also
|
||||
makes some profiling using performance counters more accurate.
|
||||
Please note that on systems with MONITOR/MWAIT support (like Intel EM64T
|
||||
CPUs) this option has no performance advantage over the normal idle loop.
|
||||
It may also interact badly with hyperthreading.
|
||||
|
||||
Rebooting
|
||||
|
||||
reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] [, [w]arm | [c]old]
|
||||
bios Use the CPU reboot vector for warm reset
|
||||
warm Don't set the cold reboot flag
|
||||
cold Set the cold reboot flag
|
||||
triple Force a triple fault (init)
|
||||
kbd Use the keyboard controller. cold reset (default)
|
||||
acpi Use the ACPI RESET_REG in the FADT. If ACPI is not configured or the
|
||||
ACPI reset does not work, the reboot path attempts the reset using
|
||||
the keyboard controller.
|
||||
efi Use efi reset_system runtime service. If EFI is not configured or the
|
||||
EFI reset does not work, the reboot path attempts the reset using
|
||||
the keyboard controller.
|
||||
|
||||
Using warm reset will be much faster especially on big memory
|
||||
systems because the BIOS will not go through the memory check.
|
||||
Disadvantage is that not all hardware will be completely reinitialized
|
||||
on reboot so there may be boot problems on some systems.
|
||||
|
||||
reboot=force
|
||||
|
||||
Don't stop other CPUs on reboot. This can make reboot more reliable
|
||||
in some cases.
|
||||
|
||||
Non Executable Mappings
|
||||
|
||||
noexec=on|off
|
||||
|
||||
on Enable(default)
|
||||
off Disable
|
||||
|
||||
SMP
|
||||
|
||||
additional_cpus=NUM Allow NUM more CPUs for hotplug
|
||||
(defaults are specified by the BIOS, see Documentation/x86_64/cpu-hotplug-spec)
|
||||
|
||||
NUMA
|
||||
|
||||
numa=off Only set up a single NUMA node spanning all memory.
|
||||
|
||||
numa=noacpi Don't parse the SRAT table for NUMA setup
|
||||
|
||||
numa=fake=CMDLINE
|
||||
If a number, fakes CMDLINE nodes and ignores NUMA setup of the
|
||||
actual machine. Otherwise, system memory is configured
|
||||
depending on the sizes and coefficients listed. For example:
|
||||
numa=fake=2*512,1024,4*256,*128
|
||||
gives two 512M nodes, a 1024M node, four 256M nodes, and the
|
||||
rest split into 128M chunks. If the last character of CMDLINE
|
||||
is a *, the remaining memory is divided up equally among its
|
||||
coefficient:
|
||||
numa=fake=2*512,2*
|
||||
gives two 512M nodes and the rest split into two nodes.
|
||||
Otherwise, the remaining system RAM is allocated to an
|
||||
additional node.
|
||||
|
||||
numa=hotadd=percent
|
||||
Only allow hotadd memory to preallocate page structures upto
|
||||
percent of already available memory.
|
||||
numa=hotadd=0 will disable hotadd memory.
|
||||
|
||||
ACPI
|
||||
|
||||
acpi=off Don't enable ACPI
|
||||
acpi=ht Use ACPI boot table parsing, but don't enable ACPI
|
||||
interpreter
|
||||
acpi=force Force ACPI on (currently not needed)
|
||||
|
||||
acpi=strict Disable out of spec ACPI workarounds.
|
||||
|
||||
acpi_sci={edge,level,high,low} Set up ACPI SCI interrupt.
|
||||
|
||||
acpi=noirq Don't route interrupts
|
||||
|
||||
PCI
|
||||
|
||||
pci=off Don't use PCI
|
||||
pci=conf1 Use conf1 access.
|
||||
pci=conf2 Use conf2 access.
|
||||
pci=rom Assign ROMs.
|
||||
pci=assign-busses Assign busses
|
||||
pci=irqmask=MASK Set PCI interrupt mask to MASK
|
||||
pci=lastbus=NUMBER Scan upto NUMBER busses, no matter what the mptable says.
|
||||
pci=noacpi Don't use ACPI to set up PCI interrupt routing.
|
||||
|
||||
IOMMU (input/output memory management unit)
|
||||
|
||||
Currently four x86-64 PCI-DMA mapping implementations exist:
|
||||
|
||||
1. <arch/x86_64/kernel/pci-nommu.c>: use no hardware/software IOMMU at all
|
||||
(e.g. because you have < 3 GB memory).
|
||||
Kernel boot message: "PCI-DMA: Disabling IOMMU"
|
||||
|
||||
2. <arch/x86_64/kernel/pci-gart.c>: AMD GART based hardware IOMMU.
|
||||
Kernel boot message: "PCI-DMA: using GART IOMMU"
|
||||
|
||||
3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
|
||||
e.g. if there is no hardware IOMMU in the system and it is need because
|
||||
you have >3GB memory or told the kernel to us it (iommu=soft))
|
||||
Kernel boot message: "PCI-DMA: Using software bounce buffering
|
||||
for IO (SWIOTLB)"
|
||||
|
||||
4. <arch/x86_64/pci-calgary.c> : IBM Calgary hardware IOMMU. Used in IBM
|
||||
pSeries and xSeries servers. This hardware IOMMU supports DMA address
|
||||
mapping with memory protection, etc.
|
||||
Kernel boot message: "PCI-DMA: Using Calgary IOMMU"
|
||||
|
||||
iommu=[<size>][,noagp][,off][,force][,noforce][,leak[=<nr_of_leak_pages>]
|
||||
[,memaper[=<order>]][,merge][,forcesac][,fullflush][,nomerge]
|
||||
[,noaperture][,calgary]
|
||||
|
||||
General iommu options:
|
||||
off Don't initialize and use any kind of IOMMU.
|
||||
noforce Don't force hardware IOMMU usage when it is not needed.
|
||||
(default).
|
||||
force Force the use of the hardware IOMMU even when it is
|
||||
not actually needed (e.g. because < 3 GB memory).
|
||||
soft Use software bounce buffering (SWIOTLB) (default for
|
||||
Intel machines). This can be used to prevent the usage
|
||||
of an available hardware IOMMU.
|
||||
|
||||
iommu options only relevant to the AMD GART hardware IOMMU:
|
||||
<size> Set the size of the remapping area in bytes.
|
||||
allowed Overwrite iommu off workarounds for specific chipsets.
|
||||
fullflush Flush IOMMU on each allocation (default).
|
||||
nofullflush Don't use IOMMU fullflush.
|
||||
leak Turn on simple iommu leak tracing (only when
|
||||
CONFIG_IOMMU_LEAK is on). Default number of leak pages
|
||||
is 20.
|
||||
memaper[=<order>] Allocate an own aperture over RAM with size 32MB<<order.
|
||||
(default: order=1, i.e. 64MB)
|
||||
merge Do scatter-gather (SG) merging. Implies "force"
|
||||
(experimental).
|
||||
nomerge Don't do scatter-gather (SG) merging.
|
||||
noaperture Ask the IOMMU not to touch the aperture for AGP.
|
||||
forcesac Force single-address cycle (SAC) mode for masks <40bits
|
||||
(experimental).
|
||||
noagp Don't initialize the AGP driver and use full aperture.
|
||||
allowdac Allow double-address cycle (DAC) mode, i.e. DMA >4GB.
|
||||
DAC is used with 32-bit PCI to push a 64-bit address in
|
||||
two cycles. When off all DMA over >4GB is forced through
|
||||
an IOMMU or software bounce buffering.
|
||||
nodac Forbid DAC mode, i.e. DMA >4GB.
|
||||
panic Always panic when IOMMU overflows.
|
||||
calgary Use the Calgary IOMMU if it is available
|
||||
|
||||
iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
|
||||
implementation:
|
||||
swiotlb=<pages>[,force]
|
||||
<pages> Prereserve that many 128K pages for the software IO
|
||||
bounce buffering.
|
||||
force Force all IO through the software TLB.
|
||||
|
||||
Settings for the IBM Calgary hardware IOMMU currently found in IBM
|
||||
pSeries and xSeries machines:
|
||||
|
||||
calgary=[64k,128k,256k,512k,1M,2M,4M,8M]
|
||||
calgary=[translate_empty_slots]
|
||||
calgary=[disable=<PCI bus number>]
|
||||
panic Always panic when IOMMU overflows
|
||||
|
||||
64k,...,8M - Set the size of each PCI slot's translation table
|
||||
when using the Calgary IOMMU. This is the size of the translation
|
||||
table itself in main memory. The smallest table, 64k, covers an IO
|
||||
space of 32MB; the largest, 8MB table, can cover an IO space of
|
||||
4GB. Normally the kernel will make the right choice by itself.
|
||||
|
||||
translate_empty_slots - Enable translation even on slots that have
|
||||
no devices attached to them, in case a device will be hotplugged
|
||||
in the future.
|
||||
|
||||
disable=<PCI bus number> - Disable translation on a given PHB. For
|
||||
example, the built-in graphics adapter resides on the first bridge
|
||||
(PCI bus number 0); if translation (isolation) is enabled on this
|
||||
bridge, X servers that access the hardware directly from user
|
||||
space might stop working. Use this option if you have devices that
|
||||
are accessed from userspace directly on some PCI host bridge.
|
||||
|
||||
Debugging
|
||||
|
||||
oops=panic Always panic on oopses. Default is to just kill the process,
|
||||
but there is a small probability of deadlocking the machine.
|
||||
This will also cause panics on machine check exceptions.
|
||||
Useful together with panic=30 to trigger a reboot.
|
||||
|
||||
kstack=N Print N words from the kernel stack in oops dumps.
|
||||
|
||||
pagefaulttrace Dump all page faults. Only useful for extreme debugging
|
||||
and will create a lot of output.
|
||||
|
||||
call_trace=[old|both|newfallback|new]
|
||||
old: use old inexact backtracer
|
||||
new: use new exact dwarf2 unwinder
|
||||
both: print entries from both
|
||||
newfallback: use new unwinder but fall back to old if it gets
|
||||
stuck (default)
|
||||
|
||||
Miscellaneous
|
||||
|
||||
nogbpages
|
||||
Do not use GB pages for kernel direct mappings.
|
||||
gbpages
|
||||
Use GB pages for kernel direct mappings.
|
21
Documentation/x86/x86_64/cpu-hotplug-spec
Normal file
21
Documentation/x86/x86_64/cpu-hotplug-spec
Normal file
@@ -0,0 +1,21 @@
|
||||
Firmware support for CPU hotplug under Linux/x86-64
|
||||
---------------------------------------------------
|
||||
|
||||
Linux/x86-64 supports CPU hotplug now. For various reasons Linux wants to
|
||||
know in advance of boot time the maximum number of CPUs that could be plugged
|
||||
into the system. ACPI 3.0 currently has no official way to supply
|
||||
this information from the firmware to the operating system.
|
||||
|
||||
In ACPI each CPU needs an LAPIC object in the MADT table (5.2.11.5 in the
|
||||
ACPI 3.0 specification). ACPI already has the concept of disabled LAPIC
|
||||
objects by setting the Enabled bit in the LAPIC object to zero.
|
||||
|
||||
For CPU hotplug Linux/x86-64 expects now that any possible future hotpluggable
|
||||
CPU is already available in the MADT. If the CPU is not available yet
|
||||
it should have its LAPIC Enabled bit set to 0. Linux will use the number
|
||||
of disabled LAPICs to compute the maximum number of future CPUs.
|
||||
|
||||
In the worst case the user can overwrite this choice using a command line
|
||||
option (additional_cpus=...), but it is recommended to supply the correct
|
||||
number (or a reasonable approximation of it, with erring towards more not less)
|
||||
in the MADT to avoid manual configuration.
|
66
Documentation/x86/x86_64/fake-numa-for-cpusets
Normal file
66
Documentation/x86/x86_64/fake-numa-for-cpusets
Normal file
@@ -0,0 +1,66 @@
|
||||
Using numa=fake and CPUSets for Resource Management
|
||||
Written by David Rientjes <rientjes@cs.washington.edu>
|
||||
|
||||
This document describes how the numa=fake x86_64 command-line option can be used
|
||||
in conjunction with cpusets for coarse memory management. Using this feature,
|
||||
you can create fake NUMA nodes that represent contiguous chunks of memory and
|
||||
assign them to cpusets and their attached tasks. This is a way of limiting the
|
||||
amount of system memory that are available to a certain class of tasks.
|
||||
|
||||
For more information on the features of cpusets, see Documentation/cpusets.txt.
|
||||
There are a number of different configurations you can use for your needs. For
|
||||
more information on the numa=fake command line option and its various ways of
|
||||
configuring fake nodes, see Documentation/x86_64/boot-options.txt.
|
||||
|
||||
For the purposes of this introduction, we'll assume a very primitive NUMA
|
||||
emulation setup of "numa=fake=4*512,". This will split our system memory into
|
||||
four equal chunks of 512M each that we can now use to assign to cpusets. As
|
||||
you become more familiar with using this combination for resource control,
|
||||
you'll determine a better setup to minimize the number of nodes you have to deal
|
||||
with.
|
||||
|
||||
A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:
|
||||
|
||||
Faking node 0 at 0000000000000000-0000000020000000 (512MB)
|
||||
Faking node 1 at 0000000020000000-0000000040000000 (512MB)
|
||||
Faking node 2 at 0000000040000000-0000000060000000 (512MB)
|
||||
Faking node 3 at 0000000060000000-0000000080000000 (512MB)
|
||||
...
|
||||
On node 0 totalpages: 130975
|
||||
On node 1 totalpages: 131072
|
||||
On node 2 totalpages: 131072
|
||||
On node 3 totalpages: 131072
|
||||
|
||||
Now following the instructions for mounting the cpusets filesystem from
|
||||
Documentation/cpusets.txt, you can assign fake nodes (i.e. contiguous memory
|
||||
address spaces) to individual cpusets:
|
||||
|
||||
[root@xroads /]# mkdir exampleset
|
||||
[root@xroads /]# mount -t cpuset none exampleset
|
||||
[root@xroads /]# mkdir exampleset/ddset
|
||||
[root@xroads /]# cd exampleset/ddset
|
||||
[root@xroads /exampleset/ddset]# echo 0-1 > cpus
|
||||
[root@xroads /exampleset/ddset]# echo 0-1 > mems
|
||||
|
||||
Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for
|
||||
memory allocations (1G).
|
||||
|
||||
You can now assign tasks to these cpusets to limit the memory resources
|
||||
available to them according to the fake nodes assigned as mems:
|
||||
|
||||
[root@xroads /exampleset/ddset]# echo $$ > tasks
|
||||
[root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G
|
||||
[1] 13425
|
||||
|
||||
Notice the difference between the system memory usage as reported by
|
||||
/proc/meminfo between the restricted cpuset case above and the unrestricted
|
||||
case (i.e. running the same 'dd' command without assigning it to a fake NUMA
|
||||
cpuset):
|
||||
Unrestricted Restricted
|
||||
MemTotal: 3091900 kB 3091900 kB
|
||||
MemFree: 42113 kB 1513236 kB
|
||||
|
||||
This allows for coarse memory management for the tasks you assign to particular
|
||||
cpusets. Since cpusets can form a hierarchy, you can create some pretty
|
||||
interesting combinations of use-cases for various classes of tasks for your
|
||||
memory management needs.
|
99
Documentation/x86/x86_64/kernel-stacks
Normal file
99
Documentation/x86/x86_64/kernel-stacks
Normal file
@@ -0,0 +1,99 @@
|
||||
Most of the text from Keith Owens, hacked by AK
|
||||
|
||||
x86_64 page size (PAGE_SIZE) is 4K.
|
||||
|
||||
Like all other architectures, x86_64 has a kernel stack for every
|
||||
active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big.
|
||||
These stacks contain useful data as long as a thread is alive or a
|
||||
zombie. While the thread is in user space the kernel stack is empty
|
||||
except for the thread_info structure at the bottom.
|
||||
|
||||
In addition to the per thread stacks, there are specialized stacks
|
||||
associated with each CPU. These stacks are only used while the kernel
|
||||
is in control on that CPU; when a CPU returns to user space the
|
||||
specialized stacks contain no useful data. The main CPU stacks are:
|
||||
|
||||
* Interrupt stack. IRQSTACKSIZE
|
||||
|
||||
Used for external hardware interrupts. If this is the first external
|
||||
hardware interrupt (i.e. not a nested hardware interrupt) then the
|
||||
kernel switches from the current task to the interrupt stack. Like
|
||||
the split thread and interrupt stacks on i386 (with CONFIG_4KSTACKS),
|
||||
this gives more room for kernel interrupt processing without having
|
||||
to increase the size of every per thread stack.
|
||||
|
||||
The interrupt stack is also used when processing a softirq.
|
||||
|
||||
Switching to the kernel interrupt stack is done by software based on a
|
||||
per CPU interrupt nest counter. This is needed because x86-64 "IST"
|
||||
hardware stacks cannot nest without races.
|
||||
|
||||
x86_64 also has a feature which is not available on i386, the ability
|
||||
to automatically switch to a new stack for designated events such as
|
||||
double fault or NMI, which makes it easier to handle these unusual
|
||||
events on x86_64. This feature is called the Interrupt Stack Table
|
||||
(IST). There can be up to 7 IST entries per CPU. The IST code is an
|
||||
index into the Task State Segment (TSS). The IST entries in the TSS
|
||||
point to dedicated stacks; each stack can be a different size.
|
||||
|
||||
An IST is selected by a non-zero value in the IST field of an
|
||||
interrupt-gate descriptor. When an interrupt occurs and the hardware
|
||||
loads such a descriptor, the hardware automatically sets the new stack
|
||||
pointer based on the IST value, then invokes the interrupt handler. If
|
||||
software wants to allow nested IST interrupts then the handler must
|
||||
adjust the IST values on entry to and exit from the interrupt handler.
|
||||
(This is occasionally done, e.g. for debug exceptions.)
|
||||
|
||||
Events with different IST codes (i.e. with different stacks) can be
|
||||
nested. For example, a debug interrupt can safely be interrupted by an
|
||||
NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack
|
||||
pointers on entry to and exit from all IST events, in theory allowing
|
||||
IST events with the same code to be nested. However in most cases, the
|
||||
stack size allocated to an IST assumes no nesting for the same code.
|
||||
If that assumption is ever broken then the stacks will become corrupt.
|
||||
|
||||
The currently assigned IST stacks are :-
|
||||
|
||||
* STACKFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
|
||||
|
||||
Used for interrupt 12 - Stack Fault Exception (#SS).
|
||||
|
||||
This allows the CPU to recover from invalid stack segments. Rarely
|
||||
happens.
|
||||
|
||||
* DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
|
||||
|
||||
Used for interrupt 8 - Double Fault Exception (#DF).
|
||||
|
||||
Invoked when handling one exception causes another exception. Happens
|
||||
when the kernel is very confused (e.g. kernel stack pointer corrupt).
|
||||
Using a separate stack allows the kernel to recover from it well enough
|
||||
in many cases to still output an oops.
|
||||
|
||||
* NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
|
||||
|
||||
Used for non-maskable interrupts (NMI).
|
||||
|
||||
NMI can be delivered at any time, including when the kernel is in the
|
||||
middle of switching stacks. Using IST for NMI events avoids making
|
||||
assumptions about the previous state of the kernel stack.
|
||||
|
||||
* DEBUG_STACK. DEBUG_STKSZ
|
||||
|
||||
Used for hardware debug interrupts (interrupt 1) and for software
|
||||
debug interrupts (INT3).
|
||||
|
||||
When debugging a kernel, debug interrupts (both hardware and
|
||||
software) can occur at any time. Using IST for these interrupts
|
||||
avoids making assumptions about the previous state of the kernel
|
||||
stack.
|
||||
|
||||
* MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
|
||||
|
||||
Used for interrupt 18 - Machine Check Exception (#MC).
|
||||
|
||||
MCE can be delivered at any time, including when the kernel is in the
|
||||
middle of switching stacks. Using IST for MCE events avoids making
|
||||
assumptions about the previous state of the kernel stack.
|
||||
|
||||
For more details see the Intel IA32 or AMD AMD64 architecture manuals.
|
77
Documentation/x86/x86_64/machinecheck
Normal file
77
Documentation/x86/x86_64/machinecheck
Normal file
@@ -0,0 +1,77 @@
|
||||
|
||||
Configurable sysfs parameters for the x86-64 machine check code.
|
||||
|
||||
Machine checks report internal hardware error conditions detected
|
||||
by the CPU. Uncorrected errors typically cause a machine check
|
||||
(often with panic), corrected ones cause a machine check log entry.
|
||||
|
||||
Machine checks are organized in banks (normally associated with
|
||||
a hardware subsystem) and subevents in a bank. The exact meaning
|
||||
of the banks and subevent is CPU specific.
|
||||
|
||||
mcelog knows how to decode them.
|
||||
|
||||
When you see the "Machine check errors logged" message in the system
|
||||
log then mcelog should run to collect and decode machine check entries
|
||||
from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
|
||||
|
||||
Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
|
||||
(N = CPU number)
|
||||
|
||||
The directory contains some configurable entries:
|
||||
|
||||
Entries:
|
||||
|
||||
bankNctl
|
||||
(N bank number)
|
||||
64bit Hex bitmask enabling/disabling specific subevents for bank N
|
||||
When a bit in the bitmask is zero then the respective
|
||||
subevent will not be reported.
|
||||
By default all events are enabled.
|
||||
Note that BIOS maintain another mask to disable specific events
|
||||
per bank. This is not visible here
|
||||
|
||||
The following entries appear for each CPU, but they are truly shared
|
||||
between all CPUs.
|
||||
|
||||
check_interval
|
||||
How often to poll for corrected machine check errors, in seconds
|
||||
(Note output is hexademical). Default 5 minutes. When the poller
|
||||
finds MCEs it triggers an exponential speedup (poll more often) on
|
||||
the polling interval. When the poller stops finding MCEs, it
|
||||
triggers an exponential backoff (poll less often) on the polling
|
||||
interval. The check_interval variable is both the initial and
|
||||
maximum polling interval.
|
||||
|
||||
tolerant
|
||||
Tolerance level. When a machine check exception occurs for a non
|
||||
corrected machine check the kernel can take different actions.
|
||||
Since machine check exceptions can happen any time it is sometimes
|
||||
risky for the kernel to kill a process because it defies
|
||||
normal kernel locking rules. The tolerance level configures
|
||||
how hard the kernel tries to recover even at some risk of
|
||||
deadlock. Higher tolerant values trade potentially better uptime
|
||||
with the risk of a crash or even corruption (for tolerant >= 3).
|
||||
|
||||
0: always panic on uncorrected errors, log corrected errors
|
||||
1: panic or SIGBUS on uncorrected errors, log corrected errors
|
||||
2: SIGBUS or log uncorrected errors, log corrected errors
|
||||
3: never panic or SIGBUS, log all errors (for testing only)
|
||||
|
||||
Default: 1
|
||||
|
||||
Note this only makes a difference if the CPU allows recovery
|
||||
from a machine check exception. Current x86 CPUs generally do not.
|
||||
|
||||
trigger
|
||||
Program to run when a machine check event is detected.
|
||||
This is an alternative to running mcelog regularly from cron
|
||||
and allows to detect events faster.
|
||||
|
||||
TBD document entries for AMD threshold interrupt configuration
|
||||
|
||||
For more details about the x86 machine check architecture
|
||||
see the Intel and AMD architecture manuals from their developer websites.
|
||||
|
||||
For more details about the architecture see
|
||||
see http://one.firstfloor.org/~andi/mce.pdf
|
28
Documentation/x86/x86_64/mm.txt
Normal file
28
Documentation/x86/x86_64/mm.txt
Normal file
@@ -0,0 +1,28 @@
|
||||
|
||||
<previous description obsolete, deleted>
|
||||
|
||||
Virtual memory map with 4 level page tables:
|
||||
|
||||
0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
|
||||
hole caused by [48:63] sign extension
|
||||
ffff800000000000 - ffff80ffffffffff (=40 bits) guard hole
|
||||
ffff810000000000 - ffffc0ffffffffff (=46 bits) direct mapping of all phys. memory
|
||||
ffffc10000000000 - ffffc1ffffffffff (=40 bits) hole
|
||||
ffffc20000000000 - ffffe1ffffffffff (=45 bits) vmalloc/ioremap space
|
||||
ffffe20000000000 - ffffe2ffffffffff (=40 bits) virtual memory map (1TB)
|
||||
... unused hole ...
|
||||
ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0
|
||||
ffffffffa0000000 - fffffffffff00000 (=1536 MB) module mapping space
|
||||
|
||||
The direct mapping covers all memory in the system up to the highest
|
||||
memory address (this means in some cases it can also include PCI memory
|
||||
holes).
|
||||
|
||||
vmalloc space is lazily synchronized into the different PML4 pages of
|
||||
the processes using the page fault handler, with init_level4_pgt as
|
||||
reference.
|
||||
|
||||
Current X86-64 implementations only support 40 bits of address space,
|
||||
but we support up to 46 bits. This expands into MBZ space in the page tables.
|
||||
|
||||
-Andi Kleen, Jul 2004
|
38
Documentation/x86/x86_64/uefi.txt
Normal file
38
Documentation/x86/x86_64/uefi.txt
Normal file
@@ -0,0 +1,38 @@
|
||||
General note on [U]EFI x86_64 support
|
||||
-------------------------------------
|
||||
|
||||
The nomenclature EFI and UEFI are used interchangeably in this document.
|
||||
|
||||
Although the tools below are _not_ needed for building the kernel,
|
||||
the needed bootloader support and associated tools for x86_64 platforms
|
||||
with EFI firmware and specifications are listed below.
|
||||
|
||||
1. UEFI specification: http://www.uefi.org
|
||||
|
||||
2. Booting Linux kernel on UEFI x86_64 platform requires bootloader
|
||||
support. Elilo with x86_64 support can be used.
|
||||
|
||||
3. x86_64 platform with EFI/UEFI firmware.
|
||||
|
||||
Mechanics:
|
||||
---------
|
||||
- Build the kernel with the following configuration.
|
||||
CONFIG_FB_EFI=y
|
||||
CONFIG_FRAMEBUFFER_CONSOLE=y
|
||||
If EFI runtime services are expected, the following configuration should
|
||||
be selected.
|
||||
CONFIG_EFI=y
|
||||
CONFIG_EFI_VARS=y or m # optional
|
||||
- Create a VFAT partition on the disk
|
||||
- Copy the following to the VFAT partition:
|
||||
elilo bootloader with x86_64 support, elilo configuration file,
|
||||
kernel image built in first step and corresponding
|
||||
initrd. Instructions on building elilo and its dependencies
|
||||
can be found in the elilo sourceforge project.
|
||||
- Boot to EFI shell and invoke elilo choosing the kernel image built
|
||||
in first step.
|
||||
- If some or all EFI runtime services don't work, you can try following
|
||||
kernel command line parameters to turn off some or all EFI runtime
|
||||
services.
|
||||
noefi turn off all EFI runtime services
|
||||
reboot_type=k turn off EFI reboot runtime service
|
Reference in New Issue
Block a user