Merge tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull rst conversion of docs from Mauro Carvalho Chehab:
 "As agreed with Jon, I'm sending this big series directly to you, c/c
  him, as this series required a special care, in order to avoid
  conflicts with other trees"

* tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits)
  docs: kbuild: fix build with pdf and fix some minor issues
  docs: block: fix pdf output
  docs: arm: fix a breakage with pdf output
  docs: don't use nested tables
  docs: gpio: add sysfs interface to the admin-guide
  docs: locking: add it to the main index
  docs: add some directories to the main documentation index
  docs: add SPDX tags to new index files
  docs: add a memory-devices subdir to driver-api
  docs: phy: place documentation under driver-api
  docs: serial: move it to the driver-api
  docs: driver-api: add remaining converted dirs to it
  docs: driver-api: add xilinx driver API documentation
  docs: driver-api: add a series of orphaned documents
  docs: admin-guide: add a series of orphaned documents
  docs: cgroup-v1: add it to the admin-guide book
  docs: aoe: add it to the driver-api book
  docs: add some documentation dirs to the driver-api book
  docs: driver-model: move it to the driver-api book
  docs: lp855x-driver.rst: add it to the driver-api book
  ...
This commit is contained in:
Linus Torvalds
2019-07-16 12:21:41 -07:00
559 changed files with 10467 additions and 7533 deletions

View File

@@ -0,0 +1,81 @@
====================
Kernel driver lp855x
====================
Backlight driver for LP855x ICs
Supported chips:
Texas Instruments LP8550, LP8551, LP8552, LP8553, LP8555, LP8556 and
LP8557
Author: Milo(Woogyom) Kim <milo.kim@ti.com>
Description
-----------
* Brightness control
Brightness can be controlled by the pwm input or the i2c command.
The lp855x driver supports both cases.
* Device attributes
1) bl_ctl_mode
Backlight control mode.
Value: pwm based or register based
2) chip_id
The lp855x chip id.
Value: lp8550/lp8551/lp8552/lp8553/lp8555/lp8556/lp8557
Platform data for lp855x
------------------------
For supporting platform specific data, the lp855x platform data can be used.
* name:
Backlight driver name. If it is not defined, default name is set.
* device_control:
Value of DEVICE CONTROL register.
* initial_brightness:
Initial value of backlight brightness.
* period_ns:
Platform specific PWM period value. unit is nano.
Only valid when brightness is pwm input mode.
* size_program:
Total size of lp855x_rom_data.
* rom_data:
List of new eeprom/eprom registers.
Examples
========
1) lp8552 platform data: i2c register mode with new eeprom data::
#define EEPROM_A5_ADDR 0xA5
#define EEPROM_A5_VAL 0x4f /* EN_VSYNC=0 */
static struct lp855x_rom_data lp8552_eeprom_arr[] = {
{EEPROM_A5_ADDR, EEPROM_A5_VAL},
};
static struct lp855x_platform_data lp8552_pdata = {
.name = "lcd-bl",
.device_control = I2C_CONFIG(LP8552),
.initial_brightness = INITIAL_BRT,
.size_program = ARRAY_SIZE(lp8552_eeprom_arr),
.rom_data = lp8552_eeprom_arr,
};
2) lp8556 platform data: pwm input mode with default rom data::
static struct lp855x_platform_data lp8556_pdata = {
.device_control = PWM_CONFIG(LP8556),
.initial_brightness = INITIAL_BRT,
.period_ns = 1000000,
};

View File

@@ -0,0 +1,62 @@
===================================================================
A driver for a selfmade cheap BT8xx based PCI GPIO-card (bt8xxgpio)
===================================================================
For advanced documentation, see http://www.bu3sch.de/btgpio.php
A generic digital 24-port PCI GPIO card can be built out of an ordinary
Brooktree bt848, bt849, bt878 or bt879 based analog TV tuner card. The
Brooktree chip is used in old analog Hauppauge WinTV PCI cards. You can easily
find them used for low prices on the net.
The bt8xx chip does have 24 digital GPIO ports.
These ports are accessible via 24 pins on the SMD chip package.
How to physically access the GPIO pins
======================================
The are several ways to access these pins. One might unsolder the whole chip
and put it on a custom PCI board, or one might only unsolder each individual
GPIO pin and solder that to some tiny wire. As the chip package really is tiny
there are some advanced soldering skills needed in any case.
The physical pinouts are drawn in the following ASCII art.
The GPIO pins are marked with G00-G23::
G G G G G G G G G G G G G G G G G G
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
---------------------------------------------------------------------------
--| ^ ^ |--
--| pin 86 pin 67 |--
--| |--
--| pin 61 > |-- G18
--| |-- G19
--| |-- G20
--| |-- G21
--| |-- G22
--| pin 56 > |-- G23
--| |--
--| Brooktree 878/879 |--
--| |--
--| |--
--| |--
--| |--
--| |--
--| |--
--| |--
--| |--
--| |--
--| |--
--| |--
--| |--
--| |--
--| O |--
--| |--
---------------------------------------------------------------------------
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
^
This is pin 1

View File

@@ -0,0 +1,156 @@
.. SPDX-License-Identifier: GPL-2.0
================
Kernel Connector
================
Kernel connector - new netlink based userspace <-> kernel space easy
to use communication module.
The Connector driver makes it easy to connect various agents using a
netlink based network. One must register a callback and an identifier.
When the driver receives a special netlink message with the appropriate
identifier, the appropriate callback will be called.
From the userspace point of view it's quite straightforward:
- socket();
- bind();
- send();
- recv();
But if kernelspace wants to use the full power of such connections, the
driver writer must create special sockets, must know about struct sk_buff
handling, etc... The Connector driver allows any kernelspace agents to use
netlink based networking for inter-process communication in a significantly
easier way::
int cn_add_callback(struct cb_id *id, char *name, void (*callback) (struct cn_msg *, struct netlink_skb_parms *));
void cn_netlink_send_multi(struct cn_msg *msg, u16 len, u32 portid, u32 __group, int gfp_mask);
void cn_netlink_send(struct cn_msg *msg, u32 portid, u32 __group, int gfp_mask);
struct cb_id
{
__u32 idx;
__u32 val;
};
idx and val are unique identifiers which must be registered in the
connector.h header for in-kernel usage. `void (*callback) (void *)` is a
callback function which will be called when a message with above idx.val
is received by the connector core. The argument for that function must
be dereferenced to `struct cn_msg *`::
struct cn_msg
{
struct cb_id id;
__u32 seq;
__u32 ack;
__u32 len; /* Length of the following data */
__u8 data[0];
};
Connector interfaces
====================
.. kernel-doc:: include/linux/connector.h
Note:
When registering new callback user, connector core assigns
netlink group to the user which is equal to its id.idx.
Protocol description
====================
The current framework offers a transport layer with fixed headers. The
recommended protocol which uses such a header is as following:
msg->seq and msg->ack are used to determine message genealogy. When
someone sends a message, they use a locally unique sequence and random
acknowledge number. The sequence number may be copied into
nlmsghdr->nlmsg_seq too.
The sequence number is incremented with each message sent.
If you expect a reply to the message, then the sequence number in the
received message MUST be the same as in the original message, and the
acknowledge number MUST be the same + 1.
If we receive a message and its sequence number is not equal to one we
are expecting, then it is a new message. If we receive a message and
its sequence number is the same as one we are expecting, but its
acknowledge is not equal to the sequence number in the original
message + 1, then it is a new message.
Obviously, the protocol header contains the above id.
The connector allows event notification in the following form: kernel
driver or userspace process can ask connector to notify it when
selected ids will be turned on or off (registered or unregistered its
callback). It is done by sending a special command to the connector
driver (it also registers itself with id={-1, -1}).
As example of this usage can be found in the cn_test.c module which
uses the connector to request notification and to send messages.
Reliability
===========
Netlink itself is not a reliable protocol. That means that messages can
be lost due to memory pressure or process' receiving queue overflowed,
so caller is warned that it must be prepared. That is why the struct
cn_msg [main connector's message header] contains u32 seq and u32 ack
fields.
Userspace usage
===============
2.6.14 has a new netlink socket implementation, which by default does not
allow people to send data to netlink groups other than 1.
So, if you wish to use a netlink socket (for example using connector)
with a different group number, the userspace application must subscribe to
that group first. It can be achieved by the following pseudocode::
s = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR);
l_local.nl_family = AF_NETLINK;
l_local.nl_groups = 12345;
l_local.nl_pid = 0;
if (bind(s, (struct sockaddr *)&l_local, sizeof(struct sockaddr_nl)) == -1) {
perror("bind");
close(s);
return -1;
}
{
int on = l_local.nl_groups;
setsockopt(s, 270, 1, &on, sizeof(on));
}
Where 270 above is SOL_NETLINK, and 1 is a NETLINK_ADD_MEMBERSHIP socket
option. To drop a multicast subscription, one should call the above socket
option with the NETLINK_DROP_MEMBERSHIP parameter which is defined as 0.
2.6.14 netlink code only allows to select a group which is less or equal to
the maximum group number, which is used at netlink_kernel_create() time.
In case of connector it is CN_NETLINK_USERS + 0xf, so if you want to use
group number 12345, you must increment CN_NETLINK_USERS to that number.
Additional 0xf numbers are allocated to be used by non-in-kernel users.
Due to this limitation, group 0xffffffff does not work now, so one can
not use add/remove connector's group notifications, but as far as I know,
only cn_test.c test module used it.
Some work in netlink area is still being done, so things can be changed in
2.6.15 timeframe, if it will happen, documentation will be updated for that
kernel.
Code samples
============
Sample code for a connector test module and user space can be found
in samples/connector/. To build this code, enable CONFIG_CONNECTOR
and CONFIG_SAMPLES.

View File

@@ -0,0 +1,152 @@
.. SPDX-License-Identifier: GPL-2.0
===============
Console Drivers
===============
The Linux kernel has 2 general types of console drivers. The first type is
assigned by the kernel to all the virtual consoles during the boot process.
This type will be called 'system driver', and only one system driver is allowed
to exist. The system driver is persistent and it can never be unloaded, though
it may become inactive.
The second type has to be explicitly loaded and unloaded. This will be called
'modular driver' by this document. Multiple modular drivers can coexist at
any time with each driver sharing the console with other drivers including
the system driver. However, modular drivers cannot take over the console
that is currently occupied by another modular driver. (Exception: Drivers that
call do_take_over_console() will succeed in the takeover regardless of the type
of driver occupying the consoles.) They can only take over the console that is
occupied by the system driver. In the same token, if the modular driver is
released by the console, the system driver will take over.
Modular drivers, from the programmer's point of view, have to call::
do_take_over_console() - load and bind driver to console layer
give_up_console() - unload driver; it will only work if driver
is fully unbound
In newer kernels, the following are also available::
do_register_con_driver()
do_unregister_con_driver()
If sysfs is enabled, the contents of /sys/class/vtconsole can be
examined. This shows the console backends currently registered by the
system which are named vtcon<n> where <n> is an integer from 0 to 15.
Thus::
ls /sys/class/vtconsole
. .. vtcon0 vtcon1
Each directory in /sys/class/vtconsole has 3 files::
ls /sys/class/vtconsole/vtcon0
. .. bind name uevent
What do these files signify?
1. bind - this is a read/write file. It shows the status of the driver if
read, or acts to bind or unbind the driver to the virtual consoles
when written to. The possible values are:
0
- means the driver is not bound and if echo'ed, commands the driver
to unbind
1
- means the driver is bound and if echo'ed, commands the driver to
bind
2. name - read-only file. Shows the name of the driver in this format::
cat /sys/class/vtconsole/vtcon0/name
(S) VGA+
'(S)' stands for a (S)ystem driver, i.e., it cannot be directly
commanded to bind or unbind
'VGA+' is the name of the driver
cat /sys/class/vtconsole/vtcon1/name
(M) frame buffer device
In this case, '(M)' stands for a (M)odular driver, one that can be
directly commanded to bind or unbind.
3. uevent - ignore this file
When unbinding, the modular driver is detached first, and then the system
driver takes over the consoles vacated by the driver. Binding, on the other
hand, will bind the driver to the consoles that are currently occupied by a
system driver.
NOTE1:
Binding and unbinding must be selected in Kconfig. It's under::
Device Drivers ->
Character devices ->
Support for binding and unbinding console drivers
NOTE2:
If any of the virtual consoles are in KD_GRAPHICS mode, then binding or
unbinding will not succeed. An example of an application that sets the
console to KD_GRAPHICS is X.
How useful is this feature? This is very useful for console driver
developers. By unbinding the driver from the console layer, one can unload the
driver, make changes, recompile, reload and rebind the driver without any need
for rebooting the kernel. For regular users who may want to switch from
framebuffer console to VGA console and vice versa, this feature also makes
this possible. (NOTE NOTE NOTE: Please read fbcon.txt under Documentation/fb
for more details.)
Notes for developers
====================
do_take_over_console() is now broken up into::
do_register_con_driver()
do_bind_con_driver() - private function
give_up_console() is a wrapper to do_unregister_con_driver(), and a driver must
be fully unbound for this call to succeed. con_is_bound() will check if the
driver is bound or not.
Guidelines for console driver writers
=====================================
In order for binding to and unbinding from the console to properly work,
console drivers must follow these guidelines:
1. All drivers, except system drivers, must call either do_register_con_driver()
or do_take_over_console(). do_register_con_driver() will just add the driver
to the console's internal list. It won't take over the
console. do_take_over_console(), as it name implies, will also take over (or
bind to) the console.
2. All resources allocated during con->con_init() must be released in
con->con_deinit().
3. All resources allocated in con->con_startup() must be released when the
driver, which was previously bound, becomes unbound. The console layer
does not have a complementary call to con->con_startup() so it's up to the
driver to check when it's legal to release these resources. Calling
con_is_bound() in con->con_deinit() will help. If the call returned
false(), then it's safe to release the resources. This balance has to be
ensured because con->con_startup() can be called again when a request to
rebind the driver to the console arrives.
4. Upon exit of the driver, ensure that the driver is totally unbound. If the
condition is satisfied, then the driver must call do_unregister_con_driver()
or give_up_console().
5. do_unregister_con_driver() can also be called on conditions which make it
impossible for the driver to service console requests. This can happen
with the framebuffer console that suddenly lost all of its drivers.
The current crop of console drivers should still work correctly, but binding
and unbinding them may cause problems. With minimal fixes, these drivers can
be made to work correctly.
Antonino Daplas <adaplas@pol.net>

View File

@@ -0,0 +1,99 @@
===================================
Dell Systems Management Base Driver
===================================
Overview
========
The Dell Systems Management Base Driver provides a sysfs interface for
systems management software such as Dell OpenManage to perform system
management interrupts and host control actions (system power cycle or
power off after OS shutdown) on certain Dell systems.
Dell OpenManage requires this driver on the following Dell PowerEdge systems:
300, 1300, 1400, 400SC, 500SC, 1500SC, 1550, 600SC, 1600SC, 650, 1655MC,
700, and 750. Other Dell software such as the open source libsmbios project
is expected to make use of this driver, and it may include the use of this
driver on other Dell systems.
The Dell libsmbios project aims towards providing access to as much BIOS
information as possible. See http://linux.dell.com/libsmbios/main/ for
more information about the libsmbios project.
System Management Interrupt
===========================
On some Dell systems, systems management software must access certain
management information via a system management interrupt (SMI). The SMI data
buffer must reside in 32-bit address space, and the physical address of the
buffer is required for the SMI. The driver maintains the memory required for
the SMI and provides a way for the application to generate the SMI.
The driver creates the following sysfs entries for systems management
software to perform these system management interrupts::
/sys/devices/platform/dcdbas/smi_data
/sys/devices/platform/dcdbas/smi_data_buf_phys_addr
/sys/devices/platform/dcdbas/smi_data_buf_size
/sys/devices/platform/dcdbas/smi_request
Systems management software must perform the following steps to execute
a SMI using this driver:
1) Lock smi_data.
2) Write system management command to smi_data.
3) Write "1" to smi_request to generate a calling interface SMI or
"2" to generate a raw SMI.
4) Read system management command response from smi_data.
5) Unlock smi_data.
Host Control Action
===================
Dell OpenManage supports a host control feature that allows the administrator
to perform a power cycle or power off of the system after the OS has finished
shutting down. On some Dell systems, this host control feature requires that
a driver perform a SMI after the OS has finished shutting down.
The driver creates the following sysfs entries for systems management software
to schedule the driver to perform a power cycle or power off host control
action after the system has finished shutting down:
/sys/devices/platform/dcdbas/host_control_action
/sys/devices/platform/dcdbas/host_control_smi_type
/sys/devices/platform/dcdbas/host_control_on_shutdown
Dell OpenManage performs the following steps to execute a power cycle or
power off host control action using this driver:
1) Write host control action to be performed to host_control_action.
2) Write type of SMI that driver needs to perform to host_control_smi_type.
3) Write "1" to host_control_on_shutdown to enable host control action.
4) Initiate OS shutdown.
(Driver will perform host control SMI when it is notified that the OS
has finished shutting down.)
Host Control SMI Type
=====================
The following table shows the value to write to host_control_smi_type to
perform a power cycle or power off host control action:
=================== =====================
PowerEdge System Host Control SMI Type
=================== =====================
300 HC_SMITYPE_TYPE1
1300 HC_SMITYPE_TYPE1
1400 HC_SMITYPE_TYPE2
500SC HC_SMITYPE_TYPE2
1500SC HC_SMITYPE_TYPE2
1550 HC_SMITYPE_TYPE2
600SC HC_SMITYPE_TYPE2
1600SC HC_SMITYPE_TYPE2
650 HC_SMITYPE_TYPE2
1655MC HC_SMITYPE_TYPE2
700 HC_SMITYPE_TYPE3
750 HC_SMITYPE_TYPE3
=================== =====================

View File

@@ -0,0 +1,128 @@
=============================================================
Usage of the new open sourced rbu (Remote BIOS Update) driver
=============================================================
Purpose
=======
Document demonstrating the use of the Dell Remote BIOS Update driver.
for updating BIOS images on Dell servers and desktops.
Scope
=====
This document discusses the functionality of the rbu driver only.
It does not cover the support needed from applications to enable the BIOS to
update itself with the image downloaded in to the memory.
Overview
========
This driver works with Dell OpenManage or Dell Update Packages for updating
the BIOS on Dell servers (starting from servers sold since 1999), desktops
and notebooks (starting from those sold in 2005).
Please go to http://support.dell.com register and you can find info on
OpenManage and Dell Update packages (DUP).
Libsmbios can also be used to update BIOS on Dell systems go to
http://linux.dell.com/libsmbios/ for details.
Dell_RBU driver supports BIOS update using the monolithic image and packetized
image methods. In case of monolithic the driver allocates a contiguous chunk
of physical pages having the BIOS image. In case of packetized the app
using the driver breaks the image in to packets of fixed sizes and the driver
would place each packet in contiguous physical memory. The driver also
maintains a link list of packets for reading them back.
If the dell_rbu driver is unloaded all the allocated memory is freed.
The rbu driver needs to have an application (as mentioned above)which will
inform the BIOS to enable the update in the next system reboot.
The user should not unload the rbu driver after downloading the BIOS image
or updating.
The driver load creates the following directories under the /sys file system::
/sys/class/firmware/dell_rbu/loading
/sys/class/firmware/dell_rbu/data
/sys/devices/platform/dell_rbu/image_type
/sys/devices/platform/dell_rbu/data
/sys/devices/platform/dell_rbu/packet_size
The driver supports two types of update mechanism; monolithic and packetized.
These update mechanism depends upon the BIOS currently running on the system.
Most of the Dell systems support a monolithic update where the BIOS image is
copied to a single contiguous block of physical memory.
In case of packet mechanism the single memory can be broken in smaller chunks
of contiguous memory and the BIOS image is scattered in these packets.
By default the driver uses monolithic memory for the update type. This can be
changed to packets during the driver load time by specifying the load
parameter image_type=packet. This can also be changed later as below::
echo packet > /sys/devices/platform/dell_rbu/image_type
In packet update mode the packet size has to be given before any packets can
be downloaded. It is done as below::
echo XXXX > /sys/devices/platform/dell_rbu/packet_size
In the packet update mechanism, the user needs to create a new file having
packets of data arranged back to back. It can be done as follows
The user creates packets header, gets the chunk of the BIOS image and
places it next to the packetheader; now, the packetheader + BIOS image chunk
added together should match the specified packet_size. This makes one
packet, the user needs to create more such packets out of the entire BIOS
image file and then arrange all these packets back to back in to one single
file.
This file is then copied to /sys/class/firmware/dell_rbu/data.
Once this file gets to the driver, the driver extracts packet_size data from
the file and spreads it across the physical memory in contiguous packet_sized
space.
This method makes sure that all the packets get to the driver in a single operation.
In monolithic update the user simply get the BIOS image (.hdr file) and copies
to the data file as is without any change to the BIOS image itself.
Do the steps below to download the BIOS image.
1) echo 1 > /sys/class/firmware/dell_rbu/loading
2) cp bios_image.hdr /sys/class/firmware/dell_rbu/data
3) echo 0 > /sys/class/firmware/dell_rbu/loading
The /sys/class/firmware/dell_rbu/ entries will remain till the following is
done.
::
echo -1 > /sys/class/firmware/dell_rbu/loading
Until this step is completed the driver cannot be unloaded.
Also echoing either mono, packet or init in to image_type will free up the
memory allocated by the driver.
If a user by accident executes steps 1 and 3 above without executing step 2;
it will make the /sys/class/firmware/dell_rbu/ entries disappear.
The entries can be recreated by doing the following::
echo init > /sys/devices/platform/dell_rbu/image_type
.. note:: echoing init in image_type does not change it original value.
Also the driver provides /sys/devices/platform/dell_rbu/data readonly file to
read back the image downloaded.
.. note::
After updating the BIOS image a user mode application needs to execute
code which sends the BIOS update request to the BIOS. So on the next reboot
the BIOS knows about the new image downloaded and it updates itself.
Also don't unload the rbu driver if the image has to be updated.

View File

@@ -0,0 +1,98 @@
==============
Driver Binding
==============
Driver binding is the process of associating a device with a device
driver that can control it. Bus drivers have typically handled this
because there have been bus-specific structures to represent the
devices and the drivers. With generic device and device driver
structures, most of the binding can take place using common code.
Bus
~~~
The bus type structure contains a list of all devices that are on that bus
type in the system. When device_register is called for a device, it is
inserted into the end of this list. The bus object also contains a
list of all drivers of that bus type. When driver_register is called
for a driver, it is inserted at the end of this list. These are the
two events which trigger driver binding.
device_register
~~~~~~~~~~~~~~~
When a new device is added, the bus's list of drivers is iterated over
to find one that supports it. In order to determine that, the device
ID of the device must match one of the device IDs that the driver
supports. The format and semantics for comparing IDs is bus-specific.
Instead of trying to derive a complex state machine and matching
algorithm, it is up to the bus driver to provide a callback to compare
a device against the IDs of a driver. The bus returns 1 if a match was
found; 0 otherwise.
int match(struct device * dev, struct device_driver * drv);
If a match is found, the device's driver field is set to the driver
and the driver's probe callback is called. This gives the driver a
chance to verify that it really does support the hardware, and that
it's in a working state.
Device Class
~~~~~~~~~~~~
Upon the successful completion of probe, the device is registered with
the class to which it belongs. Device drivers belong to one and only one
class, and that is set in the driver's devclass field.
devclass_add_device is called to enumerate the device within the class
and actually register it with the class, which happens with the
class's register_dev callback.
Driver
~~~~~~
When a driver is attached to a device, the device is inserted into the
driver's list of devices.
sysfs
~~~~~
A symlink is created in the bus's 'devices' directory that points to
the device's directory in the physical hierarchy.
A symlink is created in the driver's 'devices' directory that points
to the device's directory in the physical hierarchy.
A directory for the device is created in the class's directory. A
symlink is created in that directory that points to the device's
physical location in the sysfs tree.
A symlink can be created (though this isn't done yet) in the device's
physical directory to either its class directory, or the class's
top-level directory. One can also be created to point to its driver's
directory also.
driver_register
~~~~~~~~~~~~~~~
The process is almost identical for when a new driver is added.
The bus's list of devices is iterated over to find a match. Devices
that already have a driver are skipped. All the devices are iterated
over, to bind as many devices as possible to the driver.
Removal
~~~~~~~
When a device is removed, the reference count for it will eventually
go to 0. When it does, the remove callback of the driver is called. It
is removed from the driver's list of devices and the reference count
of the driver is decremented. All symlinks between the two are removed.
When a driver is removed, the list of devices that it supports is
iterated over, and the driver's remove callback is called for each
one. The device is removed from that list and the symlinks removed.

View File

@@ -0,0 +1,146 @@
=========
Bus Types
=========
Definition
~~~~~~~~~~
See the kerneldoc for the struct bus_type.
int bus_register(struct bus_type * bus);
Declaration
~~~~~~~~~~~
Each bus type in the kernel (PCI, USB, etc) should declare one static
object of this type. They must initialize the name field, and may
optionally initialize the match callback::
struct bus_type pci_bus_type = {
.name = "pci",
.match = pci_bus_match,
};
The structure should be exported to drivers in a header file:
extern struct bus_type pci_bus_type;
Registration
~~~~~~~~~~~~
When a bus driver is initialized, it calls bus_register. This
initializes the rest of the fields in the bus object and inserts it
into a global list of bus types. Once the bus object is registered,
the fields in it are usable by the bus driver.
Callbacks
~~~~~~~~~
match(): Attaching Drivers to Devices
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The format of device ID structures and the semantics for comparing
them are inherently bus-specific. Drivers typically declare an array
of device IDs of devices they support that reside in a bus-specific
driver structure.
The purpose of the match callback is to give the bus an opportunity to
determine if a particular driver supports a particular device by
comparing the device IDs the driver supports with the device ID of a
particular device, without sacrificing bus-specific functionality or
type-safety.
When a driver is registered with the bus, the bus's list of devices is
iterated over, and the match callback is called for each device that
does not have a driver associated with it.
Device and Driver Lists
~~~~~~~~~~~~~~~~~~~~~~~
The lists of devices and drivers are intended to replace the local
lists that many buses keep. They are lists of struct devices and
struct device_drivers, respectively. Bus drivers are free to use the
lists as they please, but conversion to the bus-specific type may be
necessary.
The LDM core provides helper functions for iterating over each list::
int bus_for_each_dev(struct bus_type * bus, struct device * start,
void * data,
int (*fn)(struct device *, void *));
int bus_for_each_drv(struct bus_type * bus, struct device_driver * start,
void * data, int (*fn)(struct device_driver *, void *));
These helpers iterate over the respective list, and call the callback
for each device or driver in the list. All list accesses are
synchronized by taking the bus's lock (read currently). The reference
count on each object in the list is incremented before the callback is
called; it is decremented after the next object has been obtained. The
lock is not held when calling the callback.
sysfs
~~~~~~~~
There is a top-level directory named 'bus'.
Each bus gets a directory in the bus directory, along with two default
directories::
/sys/bus/pci/
|-- devices
`-- drivers
Drivers registered with the bus get a directory in the bus's drivers
directory::
/sys/bus/pci/
|-- devices
`-- drivers
|-- Intel ICH
|-- Intel ICH Joystick
|-- agpgart
`-- e100
Each device that is discovered on a bus of that type gets a symlink in
the bus's devices directory to the device's directory in the physical
hierarchy::
/sys/bus/pci/
|-- devices
| |-- 00:00.0 -> ../../../root/pci0/00:00.0
| |-- 00:01.0 -> ../../../root/pci0/00:01.0
| `-- 00:02.0 -> ../../../root/pci0/00:02.0
`-- drivers
Exporting Attributes
~~~~~~~~~~~~~~~~~~~~
::
struct bus_attribute {
struct attribute attr;
ssize_t (*show)(struct bus_type *, char * buf);
ssize_t (*store)(struct bus_type *, const char * buf, size_t count);
};
Bus drivers can export attributes using the BUS_ATTR_RW macro that works
similarly to the DEVICE_ATTR_RW macro for devices. For example, a
definition like this::
static BUS_ATTR_RW(debug);
is equivalent to declaring::
static bus_attribute bus_attr_debug;
This can then be used to add and remove the attribute from the bus's
sysfs directory using::
int bus_create_file(struct bus_type *, struct bus_attribute *);
void bus_remove_file(struct bus_type *, struct bus_attribute *);

View File

@@ -0,0 +1,149 @@
==============
Device Classes
==============
Introduction
~~~~~~~~~~~~
A device class describes a type of device, like an audio or network
device. The following device classes have been identified:
<Insert List of Device Classes Here>
Each device class defines a set of semantics and a programming interface
that devices of that class adhere to. Device drivers are the
implementation of that programming interface for a particular device on
a particular bus.
Device classes are agnostic with respect to what bus a device resides
on.
Programming Interface
~~~~~~~~~~~~~~~~~~~~~
The device class structure looks like::
typedef int (*devclass_add)(struct device *);
typedef void (*devclass_remove)(struct device *);
See the kerneldoc for the struct class.
A typical device class definition would look like::
struct device_class input_devclass = {
.name = "input",
.add_device = input_add_device,
.remove_device = input_remove_device,
};
Each device class structure should be exported in a header file so it
can be used by drivers, extensions and interfaces.
Device classes are registered and unregistered with the core using::
int devclass_register(struct device_class * cls);
void devclass_unregister(struct device_class * cls);
Devices
~~~~~~~
As devices are bound to drivers, they are added to the device class
that the driver belongs to. Before the driver model core, this would
typically happen during the driver's probe() callback, once the device
has been initialized. It now happens after the probe() callback
finishes from the core.
The device is enumerated in the class. Each time a device is added to
the class, the class's devnum field is incremented and assigned to the
device. The field is never decremented, so if the device is removed
from the class and re-added, it will receive a different enumerated
value.
The class is allowed to create a class-specific structure for the
device and store it in the device's class_data pointer.
There is no list of devices in the device class. Each driver has a
list of devices that it supports. The device class has a list of
drivers of that particular class. To access all of the devices in the
class, iterate over the device lists of each driver in the class.
Device Drivers
~~~~~~~~~~~~~~
Device drivers are added to device classes when they are registered
with the core. A driver specifies the class it belongs to by setting
the struct device_driver::devclass field.
sysfs directory structure
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There is a top-level sysfs directory named 'class'.
Each class gets a directory in the class directory, along with two
default subdirectories::
class/
`-- input
|-- devices
`-- drivers
Drivers registered with the class get a symlink in the drivers/ directory
that points to the driver's directory (under its bus directory)::
class/
`-- input
|-- devices
`-- drivers
`-- usb:usb_mouse -> ../../../bus/drivers/usb_mouse/
Each device gets a symlink in the devices/ directory that points to the
device's directory in the physical hierarchy::
class/
`-- input
|-- devices
| `-- 1 -> ../../../root/pci0/00:1f.0/usb_bus/00:1f.2-1:0/
`-- drivers
Exporting Attributes
~~~~~~~~~~~~~~~~~~~~
::
struct devclass_attribute {
struct attribute attr;
ssize_t (*show)(struct device_class *, char * buf, size_t count, loff_t off);
ssize_t (*store)(struct device_class *, const char * buf, size_t count, loff_t off);
};
Class drivers can export attributes using the DEVCLASS_ATTR macro that works
similarly to the DEVICE_ATTR macro for devices. For example, a definition
like this::
static DEVCLASS_ATTR(debug,0644,show_debug,store_debug);
is equivalent to declaring::
static devclass_attribute devclass_attr_debug;
The bus driver can add and remove the attribute from the class's
sysfs directory using::
int devclass_create_file(struct device_class *, struct devclass_attribute *);
void devclass_remove_file(struct device_class *, struct devclass_attribute *);
In the example above, the file will be named 'debug' in placed in the
class's directory in sysfs.
Interfaces
~~~~~~~~~~
There may exist multiple mechanisms for accessing the same device of a
particular class type. Device interfaces describe these mechanisms.
When a device is added to a device class, the core attempts to add it
to every interface that is registered with the device class.

View File

@@ -0,0 +1,116 @@
=============================
Device Driver Design Patterns
=============================
This document describes a few common design patterns found in device drivers.
It is likely that subsystem maintainers will ask driver developers to
conform to these design patterns.
1. State Container
2. container_of()
1. State Container
~~~~~~~~~~~~~~~~~~
While the kernel contains a few device drivers that assume that they will
only be probed() once on a certain system (singletons), it is custom to assume
that the device the driver binds to will appear in several instances. This
means that the probe() function and all callbacks need to be reentrant.
The most common way to achieve this is to use the state container design
pattern. It usually has this form::
struct foo {
spinlock_t lock; /* Example member */
(...)
};
static int foo_probe(...)
{
struct foo *foo;
foo = devm_kzalloc(dev, sizeof(*foo), GFP_KERNEL);
if (!foo)
return -ENOMEM;
spin_lock_init(&foo->lock);
(...)
}
This will create an instance of struct foo in memory every time probe() is
called. This is our state container for this instance of the device driver.
Of course it is then necessary to always pass this instance of the
state around to all functions that need access to the state and its members.
For example, if the driver is registering an interrupt handler, you would
pass around a pointer to struct foo like this::
static irqreturn_t foo_handler(int irq, void *arg)
{
struct foo *foo = arg;
(...)
}
static int foo_probe(...)
{
struct foo *foo;
(...)
ret = request_irq(irq, foo_handler, 0, "foo", foo);
}
This way you always get a pointer back to the correct instance of foo in
your interrupt handler.
2. container_of()
~~~~~~~~~~~~~~~~~
Continuing on the above example we add an offloaded work::
struct foo {
spinlock_t lock;
struct workqueue_struct *wq;
struct work_struct offload;
(...)
};
static void foo_work(struct work_struct *work)
{
struct foo *foo = container_of(work, struct foo, offload);
(...)
}
static irqreturn_t foo_handler(int irq, void *arg)
{
struct foo *foo = arg;
queue_work(foo->wq, &foo->offload);
(...)
}
static int foo_probe(...)
{
struct foo *foo;
foo->wq = create_singlethread_workqueue("foo-wq");
INIT_WORK(&foo->offload, foo_work);
(...)
}
The design pattern is the same for an hrtimer or something similar that will
return a single argument which is a pointer to a struct member in the
callback.
container_of() is a macro defined in <linux/kernel.h>
What container_of() does is to obtain a pointer to the containing struct from
a pointer to a member by a simple subtraction using the offsetof() macro from
standard C, which allows something similar to object oriented behaviours.
Notice that the contained member must not be a pointer, but an actual member
for this to work.
We can see here that we avoid having global pointers to our struct foo *
instance this way, while still keeping the number of parameters passed to the
work function to a single pointer.

View File

@@ -0,0 +1,109 @@
==========================
The Basic Device Structure
==========================
See the kerneldoc for the struct device.
Programming Interface
~~~~~~~~~~~~~~~~~~~~~
The bus driver that discovers the device uses this to register the
device with the core::
int device_register(struct device * dev);
The bus should initialize the following fields:
- parent
- name
- bus_id
- bus
A device is removed from the core when its reference count goes to
0. The reference count can be adjusted using::
struct device * get_device(struct device * dev);
void put_device(struct device * dev);
get_device() will return a pointer to the struct device passed to it
if the reference is not already 0 (if it's in the process of being
removed already).
A driver can access the lock in the device structure using::
void lock_device(struct device * dev);
void unlock_device(struct device * dev);
Attributes
~~~~~~~~~~
::
struct device_attribute {
struct attribute attr;
ssize_t (*show)(struct device *dev, struct device_attribute *attr,
char *buf);
ssize_t (*store)(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count);
};
Attributes of devices can be exported by a device driver through sysfs.
Please see Documentation/filesystems/sysfs.txt for more information
on how sysfs works.
As explained in Documentation/kobject.txt, device attributes must be
created before the KOBJ_ADD uevent is generated. The only way to realize
that is by defining an attribute group.
Attributes are declared using a macro called DEVICE_ATTR::
#define DEVICE_ATTR(name,mode,show,store)
Example:::
static DEVICE_ATTR(type, 0444, show_type, NULL);
static DEVICE_ATTR(power, 0644, show_power, store_power);
This declares two structures of type struct device_attribute with respective
names 'dev_attr_type' and 'dev_attr_power'. These two attributes can be
organized as follows into a group::
static struct attribute *dev_attrs[] = {
&dev_attr_type.attr,
&dev_attr_power.attr,
NULL,
};
static struct attribute_group dev_attr_group = {
.attrs = dev_attrs,
};
static const struct attribute_group *dev_attr_groups[] = {
&dev_attr_group,
NULL,
};
This array of groups can then be associated with a device by setting the
group pointer in struct device before device_register() is invoked::
dev->groups = dev_attr_groups;
device_register(dev);
The device_register() function will use the 'groups' pointer to create the
device attributes and the device_unregister() function will use this pointer
to remove the device attributes.
Word of warning: While the kernel allows device_create_file() and
device_remove_file() to be called on a device at any time, userspace has
strict expectations on when attributes get created. When a new device is
registered in the kernel, a uevent is generated to notify userspace (like
udev) that a new device is available. If attributes are added after the
device is registered, then userspace won't get notified and userspace will
not know about the new attributes.
This is important for device driver that need to publish additional
attributes for a device at driver probe time. If the device driver simply
calls device_create_file() on the device structure passed to it, then
userspace will never be notified of the new attributes.

View File

@@ -0,0 +1,414 @@
================================
Devres - Managed Device Resource
================================
Tejun Heo <teheo@suse.de>
First draft 10 January 2007
.. contents
1. Intro : Huh? Devres?
2. Devres : Devres in a nutshell
3. Devres Group : Group devres'es and release them together
4. Details : Life time rules, calling context, ...
5. Overhead : How much do we have to pay for this?
6. List of managed interfaces: Currently implemented managed interfaces
1. Intro
--------
devres came up while trying to convert libata to use iomap. Each
iomapped address should be kept and unmapped on driver detach. For
example, a plain SFF ATA controller (that is, good old PCI IDE) in
native mode makes use of 5 PCI BARs and all of them should be
maintained.
As with many other device drivers, libata low level drivers have
sufficient bugs in ->remove and ->probe failure path. Well, yes,
that's probably because libata low level driver developers are lazy
bunch, but aren't all low level driver developers? After spending a
day fiddling with braindamaged hardware with no document or
braindamaged document, if it's finally working, well, it's working.
For one reason or another, low level drivers don't receive as much
attention or testing as core code, and bugs on driver detach or
initialization failure don't happen often enough to be noticeable.
Init failure path is worse because it's much less travelled while
needs to handle multiple entry points.
So, many low level drivers end up leaking resources on driver detach
and having half broken failure path implementation in ->probe() which
would leak resources or even cause oops when failure occurs. iomap
adds more to this mix. So do msi and msix.
2. Devres
---------
devres is basically linked list of arbitrarily sized memory areas
associated with a struct device. Each devres entry is associated with
a release function. A devres can be released in several ways. No
matter what, all devres entries are released on driver detach. On
release, the associated release function is invoked and then the
devres entry is freed.
Managed interface is created for resources commonly used by device
drivers using devres. For example, coherent DMA memory is acquired
using dma_alloc_coherent(). The managed version is called
dmam_alloc_coherent(). It is identical to dma_alloc_coherent() except
for the DMA memory allocated using it is managed and will be
automatically released on driver detach. Implementation looks like
the following::
struct dma_devres {
size_t size;
void *vaddr;
dma_addr_t dma_handle;
};
static void dmam_coherent_release(struct device *dev, void *res)
{
struct dma_devres *this = res;
dma_free_coherent(dev, this->size, this->vaddr, this->dma_handle);
}
dmam_alloc_coherent(dev, size, dma_handle, gfp)
{
struct dma_devres *dr;
void *vaddr;
dr = devres_alloc(dmam_coherent_release, sizeof(*dr), gfp);
...
/* alloc DMA memory as usual */
vaddr = dma_alloc_coherent(...);
...
/* record size, vaddr, dma_handle in dr */
dr->vaddr = vaddr;
...
devres_add(dev, dr);
return vaddr;
}
If a driver uses dmam_alloc_coherent(), the area is guaranteed to be
freed whether initialization fails half-way or the device gets
detached. If most resources are acquired using managed interface, a
driver can have much simpler init and exit code. Init path basically
looks like the following::
my_init_one()
{
struct mydev *d;
d = devm_kzalloc(dev, sizeof(*d), GFP_KERNEL);
if (!d)
return -ENOMEM;
d->ring = dmam_alloc_coherent(...);
if (!d->ring)
return -ENOMEM;
if (check something)
return -EINVAL;
...
return register_to_upper_layer(d);
}
And exit path::
my_remove_one()
{
unregister_from_upper_layer(d);
shutdown_my_hardware();
}
As shown above, low level drivers can be simplified a lot by using
devres. Complexity is shifted from less maintained low level drivers
to better maintained higher layer. Also, as init failure path is
shared with exit path, both can get more testing.
Note though that when converting current calls or assignments to
managed devm_* versions it is up to you to check if internal operations
like allocating memory, have failed. Managed resources pertains to the
freeing of these resources *only* - all other checks needed are still
on you. In some cases this may mean introducing checks that were not
necessary before moving to the managed devm_* calls.
3. Devres group
---------------
Devres entries can be grouped using devres group. When a group is
released, all contained normal devres entries and properly nested
groups are released. One usage is to rollback series of acquired
resources on failure. For example::
if (!devres_open_group(dev, NULL, GFP_KERNEL))
return -ENOMEM;
acquire A;
if (failed)
goto err;
acquire B;
if (failed)
goto err;
...
devres_remove_group(dev, NULL);
return 0;
err:
devres_release_group(dev, NULL);
return err_code;
As resource acquisition failure usually means probe failure, constructs
like above are usually useful in midlayer driver (e.g. libata core
layer) where interface function shouldn't have side effect on failure.
For LLDs, just returning error code suffices in most cases.
Each group is identified by `void *id`. It can either be explicitly
specified by @id argument to devres_open_group() or automatically
created by passing NULL as @id as in the above example. In both
cases, devres_open_group() returns the group's id. The returned id
can be passed to other devres functions to select the target group.
If NULL is given to those functions, the latest open group is
selected.
For example, you can do something like the following::
int my_midlayer_create_something()
{
if (!devres_open_group(dev, my_midlayer_create_something, GFP_KERNEL))
return -ENOMEM;
...
devres_close_group(dev, my_midlayer_create_something);
return 0;
}
void my_midlayer_destroy_something()
{
devres_release_group(dev, my_midlayer_create_something);
}
4. Details
----------
Lifetime of a devres entry begins on devres allocation and finishes
when it is released or destroyed (removed and freed) - no reference
counting.
devres core guarantees atomicity to all basic devres operations and
has support for single-instance devres types (atomic
lookup-and-add-if-not-found). Other than that, synchronizing
concurrent accesses to allocated devres data is caller's
responsibility. This is usually non-issue because bus ops and
resource allocations already do the job.
For an example of single-instance devres type, read pcim_iomap_table()
in lib/devres.c.
All devres interface functions can be called without context if the
right gfp mask is given.
5. Overhead
-----------
Each devres bookkeeping info is allocated together with requested data
area. With debug option turned off, bookkeeping info occupies 16
bytes on 32bit machines and 24 bytes on 64bit (three pointers rounded
up to ull alignment). If singly linked list is used, it can be
reduced to two pointers (8 bytes on 32bit, 16 bytes on 64bit).
Each devres group occupies 8 pointers. It can be reduced to 6 if
singly linked list is used.
Memory space overhead on ahci controller with two ports is between 300
and 400 bytes on 32bit machine after naive conversion (we can
certainly invest a bit more effort into libata core layer).
6. List of managed interfaces
-----------------------------
CLOCK
devm_clk_get()
devm_clk_get_optional()
devm_clk_put()
devm_clk_hw_register()
devm_of_clk_add_hw_provider()
devm_clk_hw_register_clkdev()
DMA
dmaenginem_async_device_register()
dmam_alloc_coherent()
dmam_alloc_attrs()
dmam_free_coherent()
dmam_pool_create()
dmam_pool_destroy()
DRM
devm_drm_dev_init()
GPIO
devm_gpiod_get()
devm_gpiod_get_index()
devm_gpiod_get_index_optional()
devm_gpiod_get_optional()
devm_gpiod_put()
devm_gpiod_unhinge()
devm_gpiochip_add_data()
devm_gpio_request()
devm_gpio_request_one()
devm_gpio_free()
I2C
devm_i2c_new_dummy_device()
IIO
devm_iio_device_alloc()
devm_iio_device_free()
devm_iio_device_register()
devm_iio_device_unregister()
devm_iio_kfifo_allocate()
devm_iio_kfifo_free()
devm_iio_triggered_buffer_setup()
devm_iio_triggered_buffer_cleanup()
devm_iio_trigger_alloc()
devm_iio_trigger_free()
devm_iio_trigger_register()
devm_iio_trigger_unregister()
devm_iio_channel_get()
devm_iio_channel_release()
devm_iio_channel_get_all()
devm_iio_channel_release_all()
INPUT
devm_input_allocate_device()
IO region
devm_release_mem_region()
devm_release_region()
devm_release_resource()
devm_request_mem_region()
devm_request_region()
devm_request_resource()
IOMAP
devm_ioport_map()
devm_ioport_unmap()
devm_ioremap()
devm_ioremap_nocache()
devm_ioremap_wc()
devm_ioremap_resource() : checks resource, requests memory region, ioremaps
devm_iounmap()
pcim_iomap()
pcim_iomap_regions() : do request_region() and iomap() on multiple BARs
pcim_iomap_table() : array of mapped addresses indexed by BAR
pcim_iounmap()
IRQ
devm_free_irq()
devm_request_any_context_irq()
devm_request_irq()
devm_request_threaded_irq()
devm_irq_alloc_descs()
devm_irq_alloc_desc()
devm_irq_alloc_desc_at()
devm_irq_alloc_desc_from()
devm_irq_alloc_descs_from()
devm_irq_alloc_generic_chip()
devm_irq_setup_generic_chip()
devm_irq_sim_init()
LED
devm_led_classdev_register()
devm_led_classdev_unregister()
MDIO
devm_mdiobus_alloc()
devm_mdiobus_alloc_size()
devm_mdiobus_free()
MEM
devm_free_pages()
devm_get_free_pages()
devm_kasprintf()
devm_kcalloc()
devm_kfree()
devm_kmalloc()
devm_kmalloc_array()
devm_kmemdup()
devm_kstrdup()
devm_kvasprintf()
devm_kzalloc()
MFD
devm_mfd_add_devices()
MUX
devm_mux_chip_alloc()
devm_mux_chip_register()
devm_mux_control_get()
PER-CPU MEM
devm_alloc_percpu()
devm_free_percpu()
PCI
devm_pci_alloc_host_bridge() : managed PCI host bridge allocation
devm_pci_remap_cfgspace() : ioremap PCI configuration space
devm_pci_remap_cfg_resource() : ioremap PCI configuration space resource
pcim_enable_device() : after success, all PCI ops become managed
pcim_pin_device() : keep PCI device enabled after release
PHY
devm_usb_get_phy()
devm_usb_put_phy()
PINCTRL
devm_pinctrl_get()
devm_pinctrl_put()
devm_pinctrl_register()
devm_pinctrl_unregister()
POWER
devm_reboot_mode_register()
devm_reboot_mode_unregister()
PWM
devm_pwm_get()
devm_pwm_put()
REGULATOR
devm_regulator_bulk_get()
devm_regulator_get()
devm_regulator_put()
devm_regulator_register()
RESET
devm_reset_control_get()
devm_reset_controller_register()
SERDEV
devm_serdev_device_open()
SLAVE DMA ENGINE
devm_acpi_dma_controller_register()
SPI
devm_spi_register_master()
WATCHDOG
devm_watchdog_register_device()

View File

@@ -0,0 +1,223 @@
==============
Device Drivers
==============
See the kerneldoc for the struct device_driver.
Allocation
~~~~~~~~~~
Device drivers are statically allocated structures. Though there may
be multiple devices in a system that a driver supports, struct
device_driver represents the driver as a whole (not a particular
device instance).
Initialization
~~~~~~~~~~~~~~
The driver must initialize at least the name and bus fields. It should
also initialize the devclass field (when it arrives), so it may obtain
the proper linkage internally. It should also initialize as many of
the callbacks as possible, though each is optional.
Declaration
~~~~~~~~~~~
As stated above, struct device_driver objects are statically
allocated. Below is an example declaration of the eepro100
driver. This declaration is hypothetical only; it relies on the driver
being converted completely to the new model::
static struct device_driver eepro100_driver = {
.name = "eepro100",
.bus = &pci_bus_type,
.probe = eepro100_probe,
.remove = eepro100_remove,
.suspend = eepro100_suspend,
.resume = eepro100_resume,
};
Most drivers will not be able to be converted completely to the new
model because the bus they belong to has a bus-specific structure with
bus-specific fields that cannot be generalized.
The most common example of this are device ID structures. A driver
typically defines an array of device IDs that it supports. The format
of these structures and the semantics for comparing device IDs are
completely bus-specific. Defining them as bus-specific entities would
sacrifice type-safety, so we keep bus-specific structures around.
Bus-specific drivers should include a generic struct device_driver in
the definition of the bus-specific driver. Like this::
struct pci_driver {
const struct pci_device_id *id_table;
struct device_driver driver;
};
A definition that included bus-specific fields would look like
(using the eepro100 driver again)::
static struct pci_driver eepro100_driver = {
.id_table = eepro100_pci_tbl,
.driver = {
.name = "eepro100",
.bus = &pci_bus_type,
.probe = eepro100_probe,
.remove = eepro100_remove,
.suspend = eepro100_suspend,
.resume = eepro100_resume,
},
};
Some may find the syntax of embedded struct initialization awkward or
even a bit ugly. So far, it's the best way we've found to do what we want...
Registration
~~~~~~~~~~~~
::
int driver_register(struct device_driver *drv);
The driver registers the structure on startup. For drivers that have
no bus-specific fields (i.e. don't have a bus-specific driver
structure), they would use driver_register and pass a pointer to their
struct device_driver object.
Most drivers, however, will have a bus-specific structure and will
need to register with the bus using something like pci_driver_register.
It is important that drivers register their driver structure as early as
possible. Registration with the core initializes several fields in the
struct device_driver object, including the reference count and the
lock. These fields are assumed to be valid at all times and may be
used by the device model core or the bus driver.
Transition Bus Drivers
~~~~~~~~~~~~~~~~~~~~~~
By defining wrapper functions, the transition to the new model can be
made easier. Drivers can ignore the generic structure altogether and
let the bus wrapper fill in the fields. For the callbacks, the bus can
define generic callbacks that forward the call to the bus-specific
callbacks of the drivers.
This solution is intended to be only temporary. In order to get class
information in the driver, the drivers must be modified anyway. Since
converting drivers to the new model should reduce some infrastructural
complexity and code size, it is recommended that they are converted as
class information is added.
Access
~~~~~~
Once the object has been registered, it may access the common fields of
the object, like the lock and the list of devices::
int driver_for_each_dev(struct device_driver *drv, void *data,
int (*callback)(struct device *dev, void *data));
The devices field is a list of all the devices that have been bound to
the driver. The LDM core provides a helper function to operate on all
the devices a driver controls. This helper locks the driver on each
node access, and does proper reference counting on each device as it
accesses it.
sysfs
~~~~~
When a driver is registered, a sysfs directory is created in its
bus's directory. In this directory, the driver can export an interface
to userspace to control operation of the driver on a global basis;
e.g. toggling debugging output in the driver.
A future feature of this directory will be a 'devices' directory. This
directory will contain symlinks to the directories of devices it
supports.
Callbacks
~~~~~~~~~
::
int (*probe) (struct device *dev);
The probe() entry is called in task context, with the bus's rwsem locked
and the driver partially bound to the device. Drivers commonly use
container_of() to convert "dev" to a bus-specific type, both in probe()
and other routines. That type often provides device resource data, such
as pci_dev.resource[] or platform_device.resources, which is used in
addition to dev->platform_data to initialize the driver.
This callback holds the driver-specific logic to bind the driver to a
given device. That includes verifying that the device is present, that
it's a version the driver can handle, that driver data structures can
be allocated and initialized, and that any hardware can be initialized.
Drivers often store a pointer to their state with dev_set_drvdata().
When the driver has successfully bound itself to that device, then probe()
returns zero and the driver model code will finish its part of binding
the driver to that device.
A driver's probe() may return a negative errno value to indicate that
the driver did not bind to this device, in which case it should have
released all resources it allocated::
int (*remove) (struct device *dev);
remove is called to unbind a driver from a device. This may be
called if a device is physically removed from the system, if the
driver module is being unloaded, during a reboot sequence, or
in other cases.
It is up to the driver to determine if the device is present or
not. It should free any resources allocated specifically for the
device; i.e. anything in the device's driver_data field.
If the device is still present, it should quiesce the device and place
it into a supported low-power state::
int (*suspend) (struct device *dev, pm_message_t state);
suspend is called to put the device in a low power state::
int (*resume) (struct device *dev);
Resume is used to bring a device back from a low power state.
Attributes
~~~~~~~~~~
::
struct driver_attribute {
struct attribute attr;
ssize_t (*show)(struct device_driver *driver, char *buf);
ssize_t (*store)(struct device_driver *, const char *buf, size_t count);
};
Device drivers can export attributes via their sysfs directories.
Drivers can declare attributes using a DRIVER_ATTR_RW and DRIVER_ATTR_RO
macro that works identically to the DEVICE_ATTR_RW and DEVICE_ATTR_RO
macros.
Example::
DRIVER_ATTR_RW(debug);
This is equivalent to declaring::
struct driver_attribute driver_attr_debug;
This can then be used to add and remove the attribute from the
driver's directory using::
int driver_create_file(struct device_driver *, const struct driver_attribute *);
void driver_remove_file(struct device_driver *, const struct driver_attribute *);

View File

@@ -0,0 +1,24 @@
============
Driver Model
============
.. toctree::
:maxdepth: 1
binding
bus
class
design-patterns
device
devres
driver
overview
platform
porting
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@@ -0,0 +1,124 @@
=============================
The Linux Kernel Device Model
=============================
Patrick Mochel <mochel@digitalimplant.org>
Drafted 26 August 2002
Updated 31 January 2006
Overview
~~~~~~~~
The Linux Kernel Driver Model is a unification of all the disparate driver
models that were previously used in the kernel. It is intended to augment the
bus-specific drivers for bridges and devices by consolidating a set of data
and operations into globally accessible data structures.
Traditional driver models implemented some sort of tree-like structure
(sometimes just a list) for the devices they control. There wasn't any
uniformity across the different bus types.
The current driver model provides a common, uniform data model for describing
a bus and the devices that can appear under the bus. The unified bus
model includes a set of common attributes which all busses carry, and a set
of common callbacks, such as device discovery during bus probing, bus
shutdown, bus power management, etc.
The common device and bridge interface reflects the goals of the modern
computer: namely the ability to do seamless device "plug and play", power
management, and hot plug. In particular, the model dictated by Intel and
Microsoft (namely ACPI) ensures that almost every device on almost any bus
on an x86-compatible system can work within this paradigm. Of course,
not every bus is able to support all such operations, although most
buses support most of those operations.
Downstream Access
~~~~~~~~~~~~~~~~~
Common data fields have been moved out of individual bus layers into a common
data structure. These fields must still be accessed by the bus layers,
and sometimes by the device-specific drivers.
Other bus layers are encouraged to do what has been done for the PCI layer.
struct pci_dev now looks like this::
struct pci_dev {
...
struct device dev; /* Generic device interface */
...
};
Note first that the struct device dev within the struct pci_dev is
statically allocated. This means only one allocation on device discovery.
Note also that that struct device dev is not necessarily defined at the
front of the pci_dev structure. This is to make people think about what
they're doing when switching between the bus driver and the global driver,
and to discourage meaningless and incorrect casts between the two.
The PCI bus layer freely accesses the fields of struct device. It knows about
the structure of struct pci_dev, and it should know the structure of struct
device. Individual PCI device drivers that have been converted to the current
driver model generally do not and should not touch the fields of struct device,
unless there is a compelling reason to do so.
The above abstraction prevents unnecessary pain during transitional phases.
If it were not done this way, then when a field was renamed or removed, every
downstream driver would break. On the other hand, if only the bus layer
(and not the device layer) accesses the struct device, it is only the bus
layer that needs to change.
User Interface
~~~~~~~~~~~~~~
By virtue of having a complete hierarchical view of all the devices in the
system, exporting a complete hierarchical view to userspace becomes relatively
easy. This has been accomplished by implementing a special purpose virtual
file system named sysfs.
Almost all mainstream Linux distros mount this filesystem automatically; you
can see some variation of the following in the output of the "mount" command::
$ mount
...
none on /sys type sysfs (rw,noexec,nosuid,nodev)
...
$
The auto-mounting of sysfs is typically accomplished by an entry similar to
the following in the /etc/fstab file::
none /sys sysfs defaults 0 0
or something similar in the /lib/init/fstab file on Debian-based systems::
none /sys sysfs nodev,noexec,nosuid 0 0
If sysfs is not automatically mounted, you can always do it manually with::
# mount -t sysfs sysfs /sys
Whenever a device is inserted into the tree, a directory is created for it.
This directory may be populated at each layer of discovery - the global layer,
the bus layer, or the device layer.
The global layer currently creates two files - 'name' and 'power'. The
former only reports the name of the device. The latter reports the
current power state of the device. It will also be used to set the current
power state.
The bus layer may also create files for the devices it finds while probing the
bus. For example, the PCI layer currently creates 'irq' and 'resource' files
for each PCI device.
A device-specific driver may also export files in its directory to expose
device-specific data or tunable interfaces.
More information about the sysfs directory layout can be found in
the other documents in this directory and in the file
Documentation/filesystems/sysfs.txt.

View File

@@ -0,0 +1,246 @@
============================
Platform Devices and Drivers
============================
See <linux/platform_device.h> for the driver model interface to the
platform bus: platform_device, and platform_driver. This pseudo-bus
is used to connect devices on busses with minimal infrastructure,
like those used to integrate peripherals on many system-on-chip
processors, or some "legacy" PC interconnects; as opposed to large
formally specified ones like PCI or USB.
Platform devices
~~~~~~~~~~~~~~~~
Platform devices are devices that typically appear as autonomous
entities in the system. This includes legacy port-based devices and
host bridges to peripheral buses, and most controllers integrated
into system-on-chip platforms. What they usually have in common
is direct addressing from a CPU bus. Rarely, a platform_device will
be connected through a segment of some other kind of bus; but its
registers will still be directly addressable.
Platform devices are given a name, used in driver binding, and a
list of resources such as addresses and IRQs::
struct platform_device {
const char *name;
u32 id;
struct device dev;
u32 num_resources;
struct resource *resource;
};
Platform drivers
~~~~~~~~~~~~~~~~
Platform drivers follow the standard driver model convention, where
discovery/enumeration is handled outside the drivers, and drivers
provide probe() and remove() methods. They support power management
and shutdown notifications using the standard conventions::
struct platform_driver {
int (*probe)(struct platform_device *);
int (*remove)(struct platform_device *);
void (*shutdown)(struct platform_device *);
int (*suspend)(struct platform_device *, pm_message_t state);
int (*suspend_late)(struct platform_device *, pm_message_t state);
int (*resume_early)(struct platform_device *);
int (*resume)(struct platform_device *);
struct device_driver driver;
};
Note that probe() should in general verify that the specified device hardware
actually exists; sometimes platform setup code can't be sure. The probing
can use device resources, including clocks, and device platform_data.
Platform drivers register themselves the normal way::
int platform_driver_register(struct platform_driver *drv);
Or, in common situations where the device is known not to be hot-pluggable,
the probe() routine can live in an init section to reduce the driver's
runtime memory footprint::
int platform_driver_probe(struct platform_driver *drv,
int (*probe)(struct platform_device *))
Kernel modules can be composed of several platform drivers. The platform core
provides helpers to register and unregister an array of drivers::
int __platform_register_drivers(struct platform_driver * const *drivers,
unsigned int count, struct module *owner);
void platform_unregister_drivers(struct platform_driver * const *drivers,
unsigned int count);
If one of the drivers fails to register, all drivers registered up to that
point will be unregistered in reverse order. Note that there is a convenience
macro that passes THIS_MODULE as owner parameter::
#define platform_register_drivers(drivers, count)
Device Enumeration
~~~~~~~~~~~~~~~~~~
As a rule, platform specific (and often board-specific) setup code will
register platform devices::
int platform_device_register(struct platform_device *pdev);
int platform_add_devices(struct platform_device **pdevs, int ndev);
The general rule is to register only those devices that actually exist,
but in some cases extra devices might be registered. For example, a kernel
might be configured to work with an external network adapter that might not
be populated on all boards, or likewise to work with an integrated controller
that some boards might not hook up to any peripherals.
In some cases, boot firmware will export tables describing the devices
that are populated on a given board. Without such tables, often the
only way for system setup code to set up the correct devices is to build
a kernel for a specific target board. Such board-specific kernels are
common with embedded and custom systems development.
In many cases, the memory and IRQ resources associated with the platform
device are not enough to let the device's driver work. Board setup code
will often provide additional information using the device's platform_data
field to hold additional information.
Embedded systems frequently need one or more clocks for platform devices,
which are normally kept off until they're actively needed (to save power).
System setup also associates those clocks with the device, so that that
calls to clk_get(&pdev->dev, clock_name) return them as needed.
Legacy Drivers: Device Probing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some drivers are not fully converted to the driver model, because they take
on a non-driver role: the driver registers its platform device, rather than
leaving that for system infrastructure. Such drivers can't be hotplugged
or coldplugged, since those mechanisms require device creation to be in a
different system component than the driver.
The only "good" reason for this is to handle older system designs which, like
original IBM PCs, rely on error-prone "probe-the-hardware" models for hardware
configuration. Newer systems have largely abandoned that model, in favor of
bus-level support for dynamic configuration (PCI, USB), or device tables
provided by the boot firmware (e.g. PNPACPI on x86). There are too many
conflicting options about what might be where, and even educated guesses by
an operating system will be wrong often enough to make trouble.
This style of driver is discouraged. If you're updating such a driver,
please try to move the device enumeration to a more appropriate location,
outside the driver. This will usually be cleanup, since such drivers
tend to already have "normal" modes, such as ones using device nodes that
were created by PNP or by platform device setup.
None the less, there are some APIs to support such legacy drivers. Avoid
using these calls except with such hotplug-deficient drivers::
struct platform_device *platform_device_alloc(
const char *name, int id);
You can use platform_device_alloc() to dynamically allocate a device, which
you will then initialize with resources and platform_device_register().
A better solution is usually::
struct platform_device *platform_device_register_simple(
const char *name, int id,
struct resource *res, unsigned int nres);
You can use platform_device_register_simple() as a one-step call to allocate
and register a device.
Device Naming and Driver Binding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The platform_device.dev.bus_id is the canonical name for the devices.
It's built from two components:
* platform_device.name ... which is also used to for driver matching.
* platform_device.id ... the device instance number, or else "-1"
to indicate there's only one.
These are concatenated, so name/id "serial"/0 indicates bus_id "serial.0", and
"serial/3" indicates bus_id "serial.3"; both would use the platform_driver
named "serial". While "my_rtc"/-1 would be bus_id "my_rtc" (no instance id)
and use the platform_driver called "my_rtc".
Driver binding is performed automatically by the driver core, invoking
driver probe() after finding a match between device and driver. If the
probe() succeeds, the driver and device are bound as usual. There are
three different ways to find such a match:
- Whenever a device is registered, the drivers for that bus are
checked for matches. Platform devices should be registered very
early during system boot.
- When a driver is registered using platform_driver_register(), all
unbound devices on that bus are checked for matches. Drivers
usually register later during booting, or by module loading.
- Registering a driver using platform_driver_probe() works just like
using platform_driver_register(), except that the driver won't
be probed later if another device registers. (Which is OK, since
this interface is only for use with non-hotpluggable devices.)
Early Platform Devices and Drivers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The early platform interfaces provide platform data to platform device
drivers early on during the system boot. The code is built on top of the
early_param() command line parsing and can be executed very early on.
Example: "earlyprintk" class early serial console in 6 steps
1. Registering early platform device data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The architecture code registers platform device data using the function
early_platform_add_devices(). In the case of early serial console this
should be hardware configuration for the serial port. Devices registered
at this point will later on be matched against early platform drivers.
2. Parsing kernel command line
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The architecture code calls parse_early_param() to parse the kernel
command line. This will execute all matching early_param() callbacks.
User specified early platform devices will be registered at this point.
For the early serial console case the user can specify port on the
kernel command line as "earlyprintk=serial.0" where "earlyprintk" is
the class string, "serial" is the name of the platform driver and
0 is the platform device id. If the id is -1 then the dot and the
id can be omitted.
3. Installing early platform drivers belonging to a certain class
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The architecture code may optionally force registration of all early
platform drivers belonging to a certain class using the function
early_platform_driver_register_all(). User specified devices from
step 2 have priority over these. This step is omitted by the serial
driver example since the early serial driver code should be disabled
unless the user has specified port on the kernel command line.
4. Early platform driver registration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compiled-in platform drivers making use of early_platform_init() are
automatically registered during step 2 or 3. The serial driver example
should use early_platform_init("earlyprintk", &platform_driver).
5. Probing of early platform drivers belonging to a certain class
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The architecture code calls early_platform_driver_probe() to match
registered early platform devices associated with a certain class with
registered early platform drivers. Matched devices will get probed().
This step can be executed at any point during the early boot. As soon
as possible may be good for the serial port case.
6. Inside the early platform driver probe()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The driver code needs to take special care during early boot, especially
when it comes to memory allocation and interrupt registration. The code
in the probe() function can use is_early_platform_device() to check if
it is called at early platform device or at the regular platform device
time. The early serial driver performs register_console() at this point.
For further information, see <linux/platform_device.h>.

View File

@@ -0,0 +1,448 @@
=======================================
Porting Drivers to the New Driver Model
=======================================
Patrick Mochel
7 January 2003
Overview
Please refer to `Documentation/driver-api/driver-model/*.rst` for definitions of
various driver types and concepts.
Most of the work of porting devices drivers to the new model happens
at the bus driver layer. This was intentional, to minimize the
negative effect on kernel drivers, and to allow a gradual transition
of bus drivers.
In a nutshell, the driver model consists of a set of objects that can
be embedded in larger, bus-specific objects. Fields in these generic
objects can replace fields in the bus-specific objects.
The generic objects must be registered with the driver model core. By
doing so, they will exported via the sysfs filesystem. sysfs can be
mounted by doing::
# mount -t sysfs sysfs /sys
The Process
Step 0: Read include/linux/device.h for object and function definitions.
Step 1: Registering the bus driver.
- Define a struct bus_type for the bus driver::
struct bus_type pci_bus_type = {
.name = "pci",
};
- Register the bus type.
This should be done in the initialization function for the bus type,
which is usually the module_init(), or equivalent, function::
static int __init pci_driver_init(void)
{
return bus_register(&pci_bus_type);
}
subsys_initcall(pci_driver_init);
The bus type may be unregistered (if the bus driver may be compiled
as a module) by doing::
bus_unregister(&pci_bus_type);
- Export the bus type for others to use.
Other code may wish to reference the bus type, so declare it in a
shared header file and export the symbol.
From include/linux/pci.h::
extern struct bus_type pci_bus_type;
From file the above code appears in::
EXPORT_SYMBOL(pci_bus_type);
- This will cause the bus to show up in /sys/bus/pci/ with two
subdirectories: 'devices' and 'drivers'::
# tree -d /sys/bus/pci/
/sys/bus/pci/
|-- devices
`-- drivers
Step 2: Registering Devices.
struct device represents a single device. It mainly contains metadata
describing the relationship the device has to other entities.
- Embed a struct device in the bus-specific device type::
struct pci_dev {
...
struct device dev; /* Generic device interface */
...
};
It is recommended that the generic device not be the first item in
the struct to discourage programmers from doing mindless casts
between the object types. Instead macros, or inline functions,
should be created to convert from the generic object type::
#define to_pci_dev(n) container_of(n, struct pci_dev, dev)
or
static inline struct pci_dev * to_pci_dev(struct kobject * kobj)
{
return container_of(n, struct pci_dev, dev);
}
This allows the compiler to verify type-safety of the operations
that are performed (which is Good).
- Initialize the device on registration.
When devices are discovered or registered with the bus type, the
bus driver should initialize the generic device. The most important
things to initialize are the bus_id, parent, and bus fields.
The bus_id is an ASCII string that contains the device's address on
the bus. The format of this string is bus-specific. This is
necessary for representing devices in sysfs.
parent is the physical parent of the device. It is important that
the bus driver sets this field correctly.
The driver model maintains an ordered list of devices that it uses
for power management. This list must be in order to guarantee that
devices are shutdown before their physical parents, and vice versa.
The order of this list is determined by the parent of registered
devices.
Also, the location of the device's sysfs directory depends on a
device's parent. sysfs exports a directory structure that mirrors
the device hierarchy. Accurately setting the parent guarantees that
sysfs will accurately represent the hierarchy.
The device's bus field is a pointer to the bus type the device
belongs to. This should be set to the bus_type that was declared
and initialized before.
Optionally, the bus driver may set the device's name and release
fields.
The name field is an ASCII string describing the device, like
"ATI Technologies Inc Radeon QD"
The release field is a callback that the driver model core calls
when the device has been removed, and all references to it have
been released. More on this in a moment.
- Register the device.
Once the generic device has been initialized, it can be registered
with the driver model core by doing::
device_register(&dev->dev);
It can later be unregistered by doing::
device_unregister(&dev->dev);
This should happen on buses that support hotpluggable devices.
If a bus driver unregisters a device, it should not immediately free
it. It should instead wait for the driver model core to call the
device's release method, then free the bus-specific object.
(There may be other code that is currently referencing the device
structure, and it would be rude to free the device while that is
happening).
When the device is registered, a directory in sysfs is created.
The PCI tree in sysfs looks like::
/sys/devices/pci0/
|-- 00:00.0
|-- 00:01.0
| `-- 01:00.0
|-- 00:02.0
| `-- 02:1f.0
| `-- 03:00.0
|-- 00:1e.0
| `-- 04:04.0
|-- 00:1f.0
|-- 00:1f.1
| |-- ide0
| | |-- 0.0
| | `-- 0.1
| `-- ide1
| `-- 1.0
|-- 00:1f.2
|-- 00:1f.3
`-- 00:1f.5
Also, symlinks are created in the bus's 'devices' directory
that point to the device's directory in the physical hierarchy::
/sys/bus/pci/devices/
|-- 00:00.0 -> ../../../devices/pci0/00:00.0
|-- 00:01.0 -> ../../../devices/pci0/00:01.0
|-- 00:02.0 -> ../../../devices/pci0/00:02.0
|-- 00:1e.0 -> ../../../devices/pci0/00:1e.0
|-- 00:1f.0 -> ../../../devices/pci0/00:1f.0
|-- 00:1f.1 -> ../../../devices/pci0/00:1f.1
|-- 00:1f.2 -> ../../../devices/pci0/00:1f.2
|-- 00:1f.3 -> ../../../devices/pci0/00:1f.3
|-- 00:1f.5 -> ../../../devices/pci0/00:1f.5
|-- 01:00.0 -> ../../../devices/pci0/00:01.0/01:00.0
|-- 02:1f.0 -> ../../../devices/pci0/00:02.0/02:1f.0
|-- 03:00.0 -> ../../../devices/pci0/00:02.0/02:1f.0/03:00.0
`-- 04:04.0 -> ../../../devices/pci0/00:1e.0/04:04.0
Step 3: Registering Drivers.
struct device_driver is a simple driver structure that contains a set
of operations that the driver model core may call.
- Embed a struct device_driver in the bus-specific driver.
Just like with devices, do something like::
struct pci_driver {
...
struct device_driver driver;
};
- Initialize the generic driver structure.
When the driver registers with the bus (e.g. doing pci_register_driver()),
initialize the necessary fields of the driver: the name and bus
fields.
- Register the driver.
After the generic driver has been initialized, call::
driver_register(&drv->driver);
to register the driver with the core.
When the driver is unregistered from the bus, unregister it from the
core by doing::
driver_unregister(&drv->driver);
Note that this will block until all references to the driver have
gone away. Normally, there will not be any.
- Sysfs representation.
Drivers are exported via sysfs in their bus's 'driver's directory.
For example::
/sys/bus/pci/drivers/
|-- 3c59x
|-- Ensoniq AudioPCI
|-- agpgart-amdk7
|-- e100
`-- serial
Step 4: Define Generic Methods for Drivers.
struct device_driver defines a set of operations that the driver model
core calls. Most of these operations are probably similar to
operations the bus already defines for drivers, but taking different
parameters.
It would be difficult and tedious to force every driver on a bus to
simultaneously convert their drivers to generic format. Instead, the
bus driver should define single instances of the generic methods that
forward call to the bus-specific drivers. For instance::
static int pci_device_remove(struct device * dev)
{
struct pci_dev * pci_dev = to_pci_dev(dev);
struct pci_driver * drv = pci_dev->driver;
if (drv) {
if (drv->remove)
drv->remove(pci_dev);
pci_dev->driver = NULL;
}
return 0;
}
The generic driver should be initialized with these methods before it
is registered::
/* initialize common driver fields */
drv->driver.name = drv->name;
drv->driver.bus = &pci_bus_type;
drv->driver.probe = pci_device_probe;
drv->driver.resume = pci_device_resume;
drv->driver.suspend = pci_device_suspend;
drv->driver.remove = pci_device_remove;
/* register with core */
driver_register(&drv->driver);
Ideally, the bus should only initialize the fields if they are not
already set. This allows the drivers to implement their own generic
methods.
Step 5: Support generic driver binding.
The model assumes that a device or driver can be dynamically
registered with the bus at any time. When registration happens,
devices must be bound to a driver, or drivers must be bound to all
devices that it supports.
A driver typically contains a list of device IDs that it supports. The
bus driver compares these IDs to the IDs of devices registered with it.
The format of the device IDs, and the semantics for comparing them are
bus-specific, so the generic model does attempt to generalize them.
Instead, a bus may supply a method in struct bus_type that does the
comparison::
int (*match)(struct device * dev, struct device_driver * drv);
match should return positive value if the driver supports the device,
and zero otherwise. It may also return error code (for example
-EPROBE_DEFER) if determining that given driver supports the device is
not possible.
When a device is registered, the bus's list of drivers is iterated
over. bus->match() is called for each one until a match is found.
When a driver is registered, the bus's list of devices is iterated
over. bus->match() is called for each device that is not already
claimed by a driver.
When a device is successfully bound to a driver, device->driver is
set, the device is added to a per-driver list of devices, and a
symlink is created in the driver's sysfs directory that points to the
device's physical directory::
/sys/bus/pci/drivers/
|-- 3c59x
| `-- 00:0b.0 -> ../../../../devices/pci0/00:0b.0
|-- Ensoniq AudioPCI
|-- agpgart-amdk7
| `-- 00:00.0 -> ../../../../devices/pci0/00:00.0
|-- e100
| `-- 00:0c.0 -> ../../../../devices/pci0/00:0c.0
`-- serial
This driver binding should replace the existing driver binding
mechanism the bus currently uses.
Step 6: Supply a hotplug callback.
Whenever a device is registered with the driver model core, the
userspace program /sbin/hotplug is called to notify userspace.
Users can define actions to perform when a device is inserted or
removed.
The driver model core passes several arguments to userspace via
environment variables, including
- ACTION: set to 'add' or 'remove'
- DEVPATH: set to the device's physical path in sysfs.
A bus driver may also supply additional parameters for userspace to
consume. To do this, a bus must implement the 'hotplug' method in
struct bus_type::
int (*hotplug) (struct device *dev, char **envp,
int num_envp, char *buffer, int buffer_size);
This is called immediately before /sbin/hotplug is executed.
Step 7: Cleaning up the bus driver.
The generic bus, device, and driver structures provide several fields
that can replace those defined privately to the bus driver.
- Device list.
struct bus_type contains a list of all devices registered with the bus
type. This includes all devices on all instances of that bus type.
An internal list that the bus uses may be removed, in favor of using
this one.
The core provides an iterator to access these devices::
int bus_for_each_dev(struct bus_type * bus, struct device * start,
void * data, int (*fn)(struct device *, void *));
- Driver list.
struct bus_type also contains a list of all drivers registered with
it. An internal list of drivers that the bus driver maintains may
be removed in favor of using the generic one.
The drivers may be iterated over, like devices::
int bus_for_each_drv(struct bus_type * bus, struct device_driver * start,
void * data, int (*fn)(struct device_driver *, void *));
Please see drivers/base/bus.c for more information.
- rwsem
struct bus_type contains an rwsem that protects all core accesses to
the device and driver lists. This can be used by the bus driver
internally, and should be used when accessing the device or driver
lists the bus maintains.
- Device and driver fields.
Some of the fields in struct device and struct device_driver duplicate
fields in the bus-specific representations of these objects. Feel free
to remove the bus-specific ones and favor the generic ones. Note
though, that this will likely mean fixing up all the drivers that
reference the bus-specific fields (though those should all be 1-line
changes).

View File

@@ -0,0 +1,119 @@
=======================
initramfs buffer format
=======================
Al Viro, H. Peter Anvin
Last revision: 2002-01-13
Starting with kernel 2.5.x, the old "initial ramdisk" protocol is
getting {replaced/complemented} with the new "initial ramfs"
(initramfs) protocol. The initramfs contents is passed using the same
memory buffer protocol used by the initrd protocol, but the contents
is different. The initramfs buffer contains an archive which is
expanded into a ramfs filesystem; this document details the format of
the initramfs buffer format.
The initramfs buffer format is based around the "newc" or "crc" CPIO
formats, and can be created with the cpio(1) utility. The cpio
archive can be compressed using gzip(1). One valid version of an
initramfs buffer is thus a single .cpio.gz file.
The full format of the initramfs buffer is defined by the following
grammar, where::
* is used to indicate "0 or more occurrences of"
(|) indicates alternatives
+ indicates concatenation
GZIP() indicates the gzip(1) of the operand
ALGN(n) means padding with null bytes to an n-byte boundary
initramfs := ("\0" | cpio_archive | cpio_gzip_archive)*
cpio_gzip_archive := GZIP(cpio_archive)
cpio_archive := cpio_file* + (<nothing> | cpio_trailer)
cpio_file := ALGN(4) + cpio_header + filename + "\0" + ALGN(4) + data
cpio_trailer := ALGN(4) + cpio_header + "TRAILER!!!\0" + ALGN(4)
In human terms, the initramfs buffer contains a collection of
compressed and/or uncompressed cpio archives (in the "newc" or "crc"
formats); arbitrary amounts zero bytes (for padding) can be added
between members.
The cpio "TRAILER!!!" entry (cpio end-of-archive) is optional, but is
not ignored; see "handling of hard links" below.
The structure of the cpio_header is as follows (all fields contain
hexadecimal ASCII numbers fully padded with '0' on the left to the
full width of the field, for example, the integer 4780 is represented
by the ASCII string "000012ac"):
============= ================== ==============================================
Field name Field size Meaning
============= ================== ==============================================
c_magic 6 bytes The string "070701" or "070702"
c_ino 8 bytes File inode number
c_mode 8 bytes File mode and permissions
c_uid 8 bytes File uid
c_gid 8 bytes File gid
c_nlink 8 bytes Number of links
c_mtime 8 bytes Modification time
c_filesize 8 bytes Size of data field
c_maj 8 bytes Major part of file device number
c_min 8 bytes Minor part of file device number
c_rmaj 8 bytes Major part of device node reference
c_rmin 8 bytes Minor part of device node reference
c_namesize 8 bytes Length of filename, including final \0
c_chksum 8 bytes Checksum of data field if c_magic is 070702;
otherwise zero
============= ================== ==============================================
The c_mode field matches the contents of st_mode returned by stat(2)
on Linux, and encodes the file type and file permissions.
The c_filesize should be zero for any file which is not a regular file
or symlink.
The c_chksum field contains a simple 32-bit unsigned sum of all the
bytes in the data field. cpio(1) refers to this as "crc", which is
clearly incorrect (a cyclic redundancy check is a different and
significantly stronger integrity check), however, this is the
algorithm used.
If the filename is "TRAILER!!!" this is actually an end-of-archive
marker; the c_filesize for an end-of-archive marker must be zero.
Handling of hard links
======================
When a nondirectory with c_nlink > 1 is seen, the (c_maj,c_min,c_ino)
tuple is looked up in a tuple buffer. If not found, it is entered in
the tuple buffer and the entry is created as usual; if found, a hard
link rather than a second copy of the file is created. It is not
necessary (but permitted) to include a second copy of the file
contents; if the file contents is not included, the c_filesize field
should be set to zero to indicate no data section follows. If data is
present, the previous instance of the file is overwritten; this allows
the data-carrying instance of a file to occur anywhere in the sequence
(GNU cpio is reported to attach the data to the last instance of a
file only.)
c_filesize must not be zero for a symlink.
When a "TRAILER!!!" end-of-archive marker is seen, the tuple buffer is
reset. This permits archives which are generated independently to be
concatenated.
To combine file data from different sources (without having to
regenerate the (c_maj,c_min,c_ino) fields), therefore, either one of
the following techniques can be used:
a) Separate the different file data sources with a "TRAILER!!!"
end-of-archive marker, or
b) Make sure c_nlink == 1 for all nondirectory entries.

View File

@@ -0,0 +1,154 @@
=======================
Early userspace support
=======================
Last update: 2004-12-20 tlh
"Early userspace" is a set of libraries and programs that provide
various pieces of functionality that are important enough to be
available while a Linux kernel is coming up, but that don't need to be
run inside the kernel itself.
It consists of several major infrastructure components:
- gen_init_cpio, a program that builds a cpio-format archive
containing a root filesystem image. This archive is compressed, and
the compressed image is linked into the kernel image.
- initramfs, a chunk of code that unpacks the compressed cpio image
midway through the kernel boot process.
- klibc, a userspace C library, currently packaged separately, that is
optimized for correctness and small size.
The cpio file format used by initramfs is the "newc" (aka "cpio -H newc")
format, and is documented in the file "buffer-format.txt". There are
two ways to add an early userspace image: specify an existing cpio
archive to be used as the image or have the kernel build process build
the image from specifications.
CPIO ARCHIVE method
-------------------
You can create a cpio archive that contains the early userspace image.
Your cpio archive should be specified in CONFIG_INITRAMFS_SOURCE and it
will be used directly. Only a single cpio file may be specified in
CONFIG_INITRAMFS_SOURCE and directory and file names are not allowed in
combination with a cpio archive.
IMAGE BUILDING method
---------------------
The kernel build process can also build an early userspace image from
source parts rather than supplying a cpio archive. This method provides
a way to create images with root-owned files even though the image was
built by an unprivileged user.
The image is specified as one or more sources in
CONFIG_INITRAMFS_SOURCE. Sources can be either directories or files -
cpio archives are *not* allowed when building from sources.
A source directory will have it and all of its contents packaged. The
specified directory name will be mapped to '/'. When packaging a
directory, limited user and group ID translation can be performed.
INITRAMFS_ROOT_UID can be set to a user ID that needs to be mapped to
user root (0). INITRAMFS_ROOT_GID can be set to a group ID that needs
to be mapped to group root (0).
A source file must be directives in the format required by the
usr/gen_init_cpio utility (run 'usr/gen_init_cpio -h' to get the
file format). The directives in the file will be passed directly to
usr/gen_init_cpio.
When a combination of directories and files are specified then the
initramfs image will be an aggregate of all of them. In this way a user
can create a 'root-image' directory and install all files into it.
Because device-special files cannot be created by a unprivileged user,
special files can be listed in a 'root-files' file. Both 'root-image'
and 'root-files' can be listed in CONFIG_INITRAMFS_SOURCE and a complete
early userspace image can be built by an unprivileged user.
As a technical note, when directories and files are specified, the
entire CONFIG_INITRAMFS_SOURCE is passed to
usr/gen_initramfs_list.sh. This means that CONFIG_INITRAMFS_SOURCE
can really be interpreted as any legal argument to
gen_initramfs_list.sh. If a directory is specified as an argument then
the contents are scanned, uid/gid translation is performed, and
usr/gen_init_cpio file directives are output. If a directory is
specified as an argument to usr/gen_initramfs_list.sh then the
contents of the file are simply copied to the output. All of the output
directives from directory scanning and file contents copying are
processed by usr/gen_init_cpio.
See also 'usr/gen_initramfs_list.sh -h'.
Where's this all leading?
=========================
The klibc distribution contains some of the necessary software to make
early userspace useful. The klibc distribution is currently
maintained separately from the kernel.
You can obtain somewhat infrequent snapshots of klibc from
https://www.kernel.org/pub/linux/libs/klibc/
For active users, you are better off using the klibc git
repository, at http://git.kernel.org/?p=libs/klibc/klibc.git
The standalone klibc distribution currently provides three components,
in addition to the klibc library:
- ipconfig, a program that configures network interfaces. It can
configure them statically, or use DHCP to obtain information
dynamically (aka "IP autoconfiguration").
- nfsmount, a program that can mount an NFS filesystem.
- kinit, the "glue" that uses ipconfig and nfsmount to replace the old
support for IP autoconfig, mount a filesystem over NFS, and continue
system boot using that filesystem as root.
kinit is built as a single statically linked binary to save space.
Eventually, several more chunks of kernel functionality will hopefully
move to early userspace:
- Almost all of init/do_mounts* (the beginning of this is already in
place)
- ACPI table parsing
- Insert unwieldy subsystem that doesn't really need to be in kernel
space here
If kinit doesn't meet your current needs and you've got bytes to burn,
the klibc distribution includes a small Bourne-compatible shell (ash)
and a number of other utilities, so you can replace kinit and build
custom initramfs images that meet your needs exactly.
For questions and help, you can sign up for the early userspace
mailing list at http://www.zytor.com/mailman/listinfo/klibc
How does it work?
=================
The kernel has currently 3 ways to mount the root filesystem:
a) all required device and filesystem drivers compiled into the kernel, no
initrd. init/main.c:init() will call prepare_namespace() to mount the
final root filesystem, based on the root= option and optional init= to run
some other init binary than listed at the end of init/main.c:init().
b) some device and filesystem drivers built as modules and stored in an
initrd. The initrd must contain a binary '/linuxrc' which is supposed to
load these driver modules. It is also possible to mount the final root
filesystem via linuxrc and use the pivot_root syscall. The initrd is
mounted and executed via prepare_namespace().
c) using initramfs. The call to prepare_namespace() must be skipped.
This means that a binary must do all the work. Said binary can be stored
into initramfs either via modifying usr/gen_init_cpio.c or via the new
initrd format, an cpio archive. It must be called "/init". This binary
is responsible to do all the things prepare_namespace() would do.
To maintain backwards compatibility, the /init binary will only run if it
comes via an initramfs cpio archive. If this is not the case,
init/main.c:init() will run prepare_namespace() to mount the final root
and exec one of the predefined init binaries.
Bryan O'Sullivan <bos@serpentine.com>

View File

@@ -0,0 +1,18 @@
.. SPDX-License-Identifier: GPL-2.0
===============
Early Userspace
===============
.. toctree::
:maxdepth: 1
early_userspace_support
buffer-format
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@@ -0,0 +1,58 @@
.. SPDX-License-Identifier: GPL-2.0
====
EDID
====
In the good old days when graphics parameters were configured explicitly
in a file called xorg.conf, even broken hardware could be managed.
Today, with the advent of Kernel Mode Setting, a graphics board is
either correctly working because all components follow the standards -
or the computer is unusable, because the screen remains dark after
booting or it displays the wrong area. Cases when this happens are:
- The graphics board does not recognize the monitor.
- The graphics board is unable to detect any EDID data.
- The graphics board incorrectly forwards EDID data to the driver.
- The monitor sends no or bogus EDID data.
- A KVM sends its own EDID data instead of querying the connected monitor.
Adding the kernel parameter "nomodeset" helps in most cases, but causes
restrictions later on.
As a remedy for such situations, the kernel configuration item
CONFIG_DRM_LOAD_EDID_FIRMWARE was introduced. It allows to provide an
individually prepared or corrected EDID data set in the /lib/firmware
directory from where it is loaded via the firmware interface. The code
(see drivers/gpu/drm/drm_edid_load.c) contains built-in data sets for
commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200,
1680x1050, 1920x1080) as binary blobs, but the kernel source tree does
not contain code to create these data. In order to elucidate the origin
of the built-in binary EDID blobs and to facilitate the creation of
individual data for a specific misbehaving monitor, commented sources
and a Makefile environment are given here.
To create binary EDID and C source code files from the existing data
material, simply type "make".
If you want to create your own EDID file, copy the file 1024x768.S,
replace the settings with your own data and add a new target to the
Makefile. Please note that the EDID data structure expects the timing
values in a different way as compared to the standard X11 format.
X11:
HTimings:
hdisp hsyncstart hsyncend htotal
VTimings:
vdisp vsyncstart vsyncend vtotal
EDID::
#define XPIX hdisp
#define XBLANK htotal-hdisp
#define XOFFSET hsyncstart-hdisp
#define XPULSE hsyncend-hsyncstart
#define YPIX vdisp
#define YBLANK vtotal-vdisp
#define YOFFSET vsyncstart-vdisp
#define YPULSE vsyncend-vsyncstart

View File

@@ -0,0 +1,230 @@
================
EISA bus support
================
:Author: Marc Zyngier <maz@wild-wind.fr.eu.org>
This document groups random notes about porting EISA drivers to the
new EISA/sysfs API.
Starting from version 2.5.59, the EISA bus is almost given the same
status as other much more mainstream busses such as PCI or USB. This
has been possible through sysfs, which defines a nice enough set of
abstractions to manage busses, devices and drivers.
Although the new API is quite simple to use, converting existing
drivers to the new infrastructure is not an easy task (mostly because
detection code is generally also used to probe ISA cards). Moreover,
most EISA drivers are among the oldest Linux drivers so, as you can
imagine, some dust has settled here over the years.
The EISA infrastructure is made up of three parts:
- The bus code implements most of the generic code. It is shared
among all the architectures that the EISA code runs on. It
implements bus probing (detecting EISA cards available on the bus),
allocates I/O resources, allows fancy naming through sysfs, and
offers interfaces for driver to register.
- The bus root driver implements the glue between the bus hardware
and the generic bus code. It is responsible for discovering the
device implementing the bus, and setting it up to be latter probed
by the bus code. This can go from something as simple as reserving
an I/O region on x86, to the rather more complex, like the hppa
EISA code. This is the part to implement in order to have EISA
running on an "new" platform.
- The driver offers the bus a list of devices that it manages, and
implements the necessary callbacks to probe and release devices
whenever told to.
Every function/structure below lives in <linux/eisa.h>, which depends
heavily on <linux/device.h>.
Bus root driver
===============
::
int eisa_root_register (struct eisa_root_device *root);
The eisa_root_register function is used to declare a device as the
root of an EISA bus. The eisa_root_device structure holds a reference
to this device, as well as some parameters for probing purposes::
struct eisa_root_device {
struct device *dev; /* Pointer to bridge device */
struct resource *res;
unsigned long bus_base_addr;
int slots; /* Max slot number */
int force_probe; /* Probe even when no slot 0 */
u64 dma_mask; /* from bridge device */
int bus_nr; /* Set by eisa_root_register */
struct resource eisa_root_res; /* ditto */
};
============= ======================================================
node used for eisa_root_register internal purpose
dev pointer to the root device
res root device I/O resource
bus_base_addr slot 0 address on this bus
slots max slot number to probe
force_probe Probe even when slot 0 is empty (no EISA mainboard)
dma_mask Default DMA mask. Usually the bridge device dma_mask.
bus_nr unique bus id, set by eisa_root_register
============= ======================================================
Driver
======
::
int eisa_driver_register (struct eisa_driver *edrv);
void eisa_driver_unregister (struct eisa_driver *edrv);
Clear enough ?
::
struct eisa_device_id {
char sig[EISA_SIG_LEN];
unsigned long driver_data;
};
struct eisa_driver {
const struct eisa_device_id *id_table;
struct device_driver driver;
};
=============== ====================================================
id_table an array of NULL terminated EISA id strings,
followed by an empty string. Each string can
optionally be paired with a driver-dependent value
(driver_data).
driver a generic driver, such as described in
Documentation/driver-api/driver-model/driver.rst. Only .name,
.probe and .remove members are mandatory.
=============== ====================================================
An example is the 3c59x driver::
static struct eisa_device_id vortex_eisa_ids[] = {
{ "TCM5920", EISA_3C592_OFFSET },
{ "TCM5970", EISA_3C597_OFFSET },
{ "" }
};
static struct eisa_driver vortex_eisa_driver = {
.id_table = vortex_eisa_ids,
.driver = {
.name = "3c59x",
.probe = vortex_eisa_probe,
.remove = vortex_eisa_remove
}
};
Device
======
The sysfs framework calls .probe and .remove functions upon device
discovery and removal (note that the .remove function is only called
when driver is built as a module).
Both functions are passed a pointer to a 'struct device', which is
encapsulated in a 'struct eisa_device' described as follows::
struct eisa_device {
struct eisa_device_id id;
int slot;
int state;
unsigned long base_addr;
struct resource res[EISA_MAX_RESOURCES];
u64 dma_mask;
struct device dev; /* generic device */
};
======== ============================================================
id EISA id, as read from device. id.driver_data is set from the
matching driver EISA id.
slot slot number which the device was detected on
state set of flags indicating the state of the device. Current
flags are EISA_CONFIG_ENABLED and EISA_CONFIG_FORCED.
res set of four 256 bytes I/O regions allocated to this device
dma_mask DMA mask set from the parent device.
dev generic device (see Documentation/driver-api/driver-model/device.rst)
======== ============================================================
You can get the 'struct eisa_device' from 'struct device' using the
'to_eisa_device' macro.
Misc stuff
==========
::
void eisa_set_drvdata (struct eisa_device *edev, void *data);
Stores data into the device's driver_data area.
::
void *eisa_get_drvdata (struct eisa_device *edev):
Gets the pointer previously stored into the device's driver_data area.
::
int eisa_get_region_index (void *addr);
Returns the region number (0 <= x < EISA_MAX_RESOURCES) of a given
address.
Kernel parameters
=================
eisa_bus.enable_dev
A comma-separated list of slots to be enabled, even if the firmware
set the card as disabled. The driver must be able to properly
initialize the device in such conditions.
eisa_bus.disable_dev
A comma-separated list of slots to be enabled, even if the firmware
set the card as enabled. The driver won't be called to handle this
device.
virtual_root.force_probe
Force the probing code to probe EISA slots even when it cannot find an
EISA compliant mainboard (nothing appears on slot 0). Defaults to 0
(don't force), and set to 1 (force probing) when either
CONFIG_ALPHA_JENSEN or CONFIG_EISA_VLB_PRIMING are set.
Random notes
============
Converting an EISA driver to the new API mostly involves *deleting*
code (since probing is now in the core EISA code). Unfortunately, most
drivers share their probing routine between ISA, and EISA. Special
care must be taken when ripping out the EISA code, so other busses
won't suffer from these surgical strikes...
You *must not* expect any EISA device to be detected when returning
from eisa_driver_register, since the chances are that the bus has not
yet been probed. In fact, that's what happens most of the time (the
bus root driver usually kicks in rather late in the boot process).
Unfortunately, most drivers are doing the probing by themselves, and
expect to have explored the whole machine when they exit their probe
routine.
For example, switching your favorite EISA SCSI card to the "hotplug"
model is "the right thing"(tm).
Thanks
======
I'd like to thank the following people for their help:
- Xavier Benigni for lending me a wonderful Alpha Jensen,
- James Bottomley, Jeff Garzik for getting this stuff into the kernel,
- Andries Brouwer for contributing numerous EISA ids,
- Catrin Jones for coping with far too many machines at home.

View File

@@ -399,7 +399,7 @@ symbol:
will pass the struct gpio_chip* for the chip to all IRQ callbacks, so the
callbacks need to embed the gpio_chip in its state container and obtain a
pointer to the container using container_of().
(See Documentation/driver-model/design-patterns.rst)
(See Documentation/driver-api/driver-model/design-patterns.rst)
- gpiochip_irqchip_add_nested(): adds a nested cascaded irqchip to a gpiochip,
as discussed above regarding different types of cascaded irqchips. The

View File

@@ -14,8 +14,10 @@ available subsections can be seen below.
.. toctree::
:maxdepth: 2
driver-model/index
basics
infrastructure
early-userspace/index
pm/index
clk
device-io
@@ -36,6 +38,7 @@ available subsections can be seen below.
i2c
ipmb
i3c/index
interconnect
hsi
edac
scsi
@@ -44,8 +47,11 @@ available subsections can be seen below.
mtdnand
miscellaneous
mei/index
mtd/index
mmc/index
nvdimm/index
w1
rapidio
rapidio/index
s390-drivers
vme
80211/index
@@ -53,13 +59,48 @@ available subsections can be seen below.
firmware/index
pinctl
gpio/index
md/index
misc_devices
nfc/index
dmaengine/index
slimbus
soundwire/index
fpga/index
acpi/index
backlight/lp855x-driver.rst
bt8xxgpio
connector
console
dcdbas
dell_rbu
edid
eisa
isa
isapnp
generic-counter
lightnvm-pblk
memory-devices/index
men-chameleon-bus
ntb
nvmem
parport-lowlevel
pps
ptp
phy/index
pti_intel_mid
pwm
rfkill
serial/index
sgi-ioc4
sm501
smsc_ece1099
switchtec
sync_file
vfio-mediated-device
vfio
xilinx/index
xillybus
zorro
.. only:: subproject and html

View File

@@ -0,0 +1,93 @@
.. SPDX-License-Identifier: GPL-2.0
=====================================
GENERIC SYSTEM INTERCONNECT SUBSYSTEM
=====================================
Introduction
------------
This framework is designed to provide a standard kernel interface to control
the settings of the interconnects on an SoC. These settings can be throughput,
latency and priority between multiple interconnected devices or functional
blocks. This can be controlled dynamically in order to save power or provide
maximum performance.
The interconnect bus is hardware with configurable parameters, which can be
set on a data path according to the requests received from various drivers.
An example of interconnect buses are the interconnects between various
components or functional blocks in chipsets. There can be multiple interconnects
on an SoC that can be multi-tiered.
Below is a simplified diagram of a real-world SoC interconnect bus topology.
::
+----------------+ +----------------+
| HW Accelerator |--->| M NoC |<---------------+
+----------------+ +----------------+ |
| | +------------+
+-----+ +-------------+ V +------+ | |
| DDR | | +--------+ | PCIe | | |
+-----+ | | Slaves | +------+ | |
^ ^ | +--------+ | | C NoC |
| | V V | |
+------------------+ +------------------------+ | | +-----+
| |-->| |-->| |-->| CPU |
| |-->| |<--| | +-----+
| Mem NoC | | S NoC | +------------+
| |<--| |---------+ |
| |<--| |<------+ | | +--------+
+------------------+ +------------------------+ | | +-->| Slaves |
^ ^ ^ ^ ^ | | +--------+
| | | | | | V
+------+ | +-----+ +-----+ +---------+ +----------------+ +--------+
| CPUs | | | GPU | | DSP | | Masters |-->| P NoC |-->| Slaves |
+------+ | +-----+ +-----+ +---------+ +----------------+ +--------+
|
+-------+
| Modem |
+-------+
Terminology
-----------
Interconnect provider is the software definition of the interconnect hardware.
The interconnect providers on the above diagram are M NoC, S NoC, C NoC, P NoC
and Mem NoC.
Interconnect node is the software definition of the interconnect hardware
port. Each interconnect provider consists of multiple interconnect nodes,
which are connected to other SoC components including other interconnect
providers. The point on the diagram where the CPUs connect to the memory is
called an interconnect node, which belongs to the Mem NoC interconnect provider.
Interconnect endpoints are the first or the last element of the path. Every
endpoint is a node, but not every node is an endpoint.
Interconnect path is everything between two endpoints including all the nodes
that have to be traversed to reach from a source to destination node. It may
include multiple master-slave pairs across several interconnect providers.
Interconnect consumers are the entities which make use of the data paths exposed
by the providers. The consumers send requests to providers requesting various
throughput, latency and priority. Usually the consumers are device drivers, that
send request based on their needs. An example for a consumer is a video decoder
that supports various formats and image sizes.
Interconnect providers
----------------------
Interconnect provider is an entity that implements methods to initialize and
configure interconnect bus hardware. The interconnect provider drivers should
be registered with the interconnect provider core.
.. kernel-doc:: include/linux/interconnect-provider.h
Interconnect consumers
----------------------
Interconnect consumers are the clients which use the interconnect APIs to
get paths between endpoints and set their bandwidth/latency/QoS requirements
for these interconnect paths. These interfaces are not currently
documented.

View File

@@ -0,0 +1,122 @@
===========
ISA Drivers
===========
The following text is adapted from the commit message of the initial
commit of the ISA bus driver authored by Rene Herman.
During the recent "isa drivers using platform devices" discussion it was
pointed out that (ALSA) ISA drivers ran into the problem of not having
the option to fail driver load (device registration rather) upon not
finding their hardware due to a probe() error not being passed up
through the driver model. In the course of that, I suggested a separate
ISA bus might be best; Russell King agreed and suggested this bus could
use the .match() method for the actual device discovery.
The attached does this. For this old non (generically) discoverable ISA
hardware only the driver itself can do discovery so as a difference with
the platform_bus, this isa_bus also distributes match() up to the
driver.
As another difference: these devices only exist in the driver model due
to the driver creating them because it might want to drive them, meaning
that all device creation has been made internal as well.
The usage model this provides is nice, and has been acked from the ALSA
side by Takashi Iwai and Jaroslav Kysela. The ALSA driver module_init's
now (for oldisa-only drivers) become::
static int __init alsa_card_foo_init(void)
{
return isa_register_driver(&snd_foo_isa_driver, SNDRV_CARDS);
}
static void __exit alsa_card_foo_exit(void)
{
isa_unregister_driver(&snd_foo_isa_driver);
}
Quite like the other bus models therefore. This removes a lot of
duplicated init code from the ALSA ISA drivers.
The passed in isa_driver struct is the regular driver struct embedding a
struct device_driver, the normal probe/remove/shutdown/suspend/resume
callbacks, and as indicated that .match callback.
The "SNDRV_CARDS" you see being passed in is a "unsigned int ndev"
parameter, indicating how many devices to create and call our methods
with.
The platform_driver callbacks are called with a platform_device param;
the isa_driver callbacks are being called with a ``struct device *dev,
unsigned int id`` pair directly -- with the device creation completely
internal to the bus it's much cleaner to not leak isa_dev's by passing
them in at all. The id is the only thing we ever want other then the
struct device anyways, and it makes for nicer code in the callbacks as
well.
With this additional .match() callback ISA drivers have all options. If
ALSA would want to keep the old non-load behaviour, it could stick all
of the old .probe in .match, which would only keep them registered after
everything was found to be present and accounted for. If it wanted the
behaviour of always loading as it inadvertently did for a bit after the
changeover to platform devices, it could just not provide a .match() and
do everything in .probe() as before.
If it, as Takashi Iwai already suggested earlier as a way of following
the model from saner buses more closely, wants to load when a later bind
could conceivably succeed, it could use .match() for the prerequisites
(such as checking the user wants the card enabled and that port/irq/dma
values have been passed in) and .probe() for everything else. This is
the nicest model.
To the code...
This exports only two functions; isa_{,un}register_driver().
isa_register_driver() register's the struct device_driver, and then
loops over the passed in ndev creating devices and registering them.
This causes the bus match method to be called for them, which is::
int isa_bus_match(struct device *dev, struct device_driver *driver)
{
struct isa_driver *isa_driver = to_isa_driver(driver);
if (dev->platform_data == isa_driver) {
if (!isa_driver->match ||
isa_driver->match(dev, to_isa_dev(dev)->id))
return 1;
dev->platform_data = NULL;
}
return 0;
}
The first thing this does is check if this device is in fact one of this
driver's devices by seeing if the device's platform_data pointer is set
to this driver. Platform devices compare strings, but we don't need to
do that with everything being internal, so isa_register_driver() abuses
dev->platform_data as a isa_driver pointer which we can then check here.
I believe platform_data is available for this, but if rather not, moving
the isa_driver pointer to the private struct isa_dev is ofcourse fine as
well.
Then, if the the driver did not provide a .match, it matches. If it did,
the driver match() method is called to determine a match.
If it did **not** match, dev->platform_data is reset to indicate this to
isa_register_driver which can then unregister the device again.
If during all this, there's any error, or no devices matched at all
everything is backed out again and the error, or -ENODEV, is returned.
isa_unregister_driver() just unregisters the matched devices and the
driver itself.
module_isa_driver is a helper macro for ISA drivers which do not do
anything special in module init/exit. This eliminates a lot of
boilerplate code. Each module may only use this macro once, and calling
it replaces module_init and module_exit.
max_num_isa_dev is a macro to determine the maximum possible number of
ISA devices which may be registered in the I/O port address space given
the address extent of the ISA devices.

View File

@@ -0,0 +1,15 @@
==========================================================
ISA Plug & Play support by Jaroslav Kysela <perex@suse.cz>
==========================================================
Interface /proc/isapnp
======================
The interface has been removed. See pnp.txt for more details.
Interface /proc/bus/isapnp
==========================
This directory allows access to ISA PnP cards and logical devices.
The regular files contain the contents of ISA PnP registers for
a logical device.

View File

@@ -0,0 +1,21 @@
pblk: Physical Block Device Target
==================================
pblk implements a fully associative, host-based FTL that exposes a traditional
block I/O interface. Its primary responsibilities are:
- Map logical addresses onto physical addresses (4KB granularity) in a
logical-to-physical (L2P) table.
- Maintain the integrity and consistency of the L2P table as well as its
recovery from normal tear down and power outage.
- Deal with controller- and media-specific constrains.
- Handle I/O errors.
- Implement garbage collection.
- Maintain consistency across the I/O stack during synchronization points.
For more information please refer to:
http://lightnvm.io
which maintains updated FAQs, manual pages, technical documentation, tools,
contacts, etc.

View File

@@ -0,0 +1,12 @@
.. SPDX-License-Identifier: GPL-2.0
====
RAID
====
.. toctree::
:maxdepth: 1
md-cluster
raid5-cache
raid5-ppl

View File

@@ -0,0 +1,385 @@
==========
MD Cluster
==========
The cluster MD is a shared-device RAID for a cluster, it supports
two levels: raid1 and raid10 (limited support).
1. On-disk format
=================
Separate write-intent-bitmaps are used for each cluster node.
The bitmaps record all writes that may have been started on that node,
and may not yet have finished. The on-disk layout is::
0 4k 8k 12k
-------------------------------------------------------------------
| idle | md super | bm super [0] + bits |
| bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] |
| bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits |
| bm bits [3, contd] | | |
During "normal" functioning we assume the filesystem ensures that only
one node writes to any given block at a time, so a write request will
- set the appropriate bit (if not already set)
- commit the write to all mirrors
- schedule the bit to be cleared after a timeout.
Reads are just handled normally. It is up to the filesystem to ensure
one node doesn't read from a location where another node (or the same
node) is writing.
2. DLM Locks for management
===========================
There are three groups of locks for managing the device:
2.1 Bitmap lock resource (bm_lockres)
-------------------------------------
The bm_lockres protects individual node bitmaps. They are named in
the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a
node joins the cluster, it acquires the lock in PW mode and it stays
so during the lifetime the node is part of the cluster. The lock
resource number is based on the slot number returned by the DLM
subsystem. Since DLM starts node count from one and bitmap slots
start from zero, one is subtracted from the DLM slot number to arrive
at the bitmap slot number.
The LVB of the bitmap lock for a particular node records the range
of sectors that are being re-synced by that node. No other
node may write to those sectors. This is used when a new nodes
joins the cluster.
2.2 Message passing locks
-------------------------
Each node has to communicate with other nodes when starting or ending
resync, and for metadata superblock updates. This communication is
managed through three locks: "token", "message", and "ack", together
with the Lock Value Block (LVB) of one of the "message" lock.
2.3 new-device management
-------------------------
A single lock: "no-new-dev" is used to co-ordinate the addition of
new devices - this must be synchronized across the array.
Normally all nodes hold a concurrent-read lock on this device.
3. Communication
================
Messages can be broadcast to all nodes, and the sender waits for all
other nodes to acknowledge the message before proceeding. Only one
message can be processed at a time.
3.1 Message Types
-----------------
There are six types of messages which are passed:
3.1.1 METADATA_UPDATED
^^^^^^^^^^^^^^^^^^^^^^
informs other nodes that the metadata has
been updated, and the node must re-read the md superblock. This is
performed synchronously. It is primarily used to signal device
failure.
3.1.2 RESYNCING
^^^^^^^^^^^^^^^
informs other nodes that a resync is initiated or
ended so that each node may suspend or resume the region. Each
RESYNCING message identifies a range of the devices that the
sending node is about to resync. This overrides any previous
notification from that node: only one ranged can be resynced at a
time per-node.
3.1.3 NEWDISK
^^^^^^^^^^^^^
informs other nodes that a device is being added to
the array. Message contains an identifier for that device. See
below for further details.
3.1.4 REMOVE
^^^^^^^^^^^^
A failed or spare device is being removed from the
array. The slot-number of the device is included in the message.
3.1.5 RE_ADD:
A failed device is being re-activated - the assumption
is that it has been determined to be working again.
3.1.6 BITMAP_NEEDS_SYNC:
If a node is stopped locally but the bitmap
isn't clean, then another node is informed to take the ownership of
resync.
3.2 Communication mechanism
---------------------------
The DLM LVB is used to communicate within nodes of the cluster. There
are three resources used for the purpose:
3.2.1 token
^^^^^^^^^^^
The resource which protects the entire communication
system. The node having the token resource is allowed to
communicate.
3.2.2 message
^^^^^^^^^^^^^
The lock resource which carries the data to communicate.
3.2.3 ack
^^^^^^^^^
The resource, acquiring which means the message has been
acknowledged by all nodes in the cluster. The BAST of the resource
is used to inform the receiving node that a node wants to
communicate.
The algorithm is:
1. receive status - all nodes have concurrent-reader lock on "ack"::
sender receiver receiver
"ack":CR "ack":CR "ack":CR
2. sender get EX on "token",
sender get EX on "message"::
sender receiver receiver
"token":EX "ack":CR "ack":CR
"message":EX
"ack":CR
Sender checks that it still needs to send a message. Messages
received or other events that happened while waiting for the
"token" may have made this message inappropriate or redundant.
3. sender writes LVB
sender down-convert "message" from EX to CW
sender try to get EX of "ack"
::
[ wait until all receivers have *processed* the "message" ]
[ triggered by bast of "ack" ]
receiver get CR on "message"
receiver read LVB
receiver processes the message
[ wait finish ]
receiver releases "ack"
receiver tries to get PR on "message"
sender receiver receiver
"token":EX "message":CR "message":CR
"message":CW
"ack":EX
4. triggered by grant of EX on "ack" (indicating all receivers
have processed message)
sender down-converts "ack" from EX to CR
sender releases "message"
sender releases "token"
::
receiver upconvert to PR on "message"
receiver get CR of "ack"
receiver release "message"
sender receiver receiver
"ack":CR "ack":CR "ack":CR
4. Handling Failures
====================
4.1 Node Failure
----------------
When a node fails, the DLM informs the cluster with the slot
number. The node starts a cluster recovery thread. The cluster
recovery thread:
- acquires the bitmap<number> lock of the failed node
- opens the bitmap
- reads the bitmap of the failed node
- copies the set bitmap to local node
- cleans the bitmap of the failed node
- releases bitmap<number> lock of the failed node
- initiates resync of the bitmap on the current node
md_check_recovery is invoked within recover_bitmaps,
then md_check_recovery -> metadata_update_start/finish,
it will lock the communication by lock_comm.
Which means when one node is resyncing it blocks all
other nodes from writing anywhere on the array.
The resync process is the regular md resync. However, in a clustered
environment when a resync is performed, it needs to tell other nodes
of the areas which are suspended. Before a resync starts, the node
send out RESYNCING with the (lo,hi) range of the area which needs to
be suspended. Each node maintains a suspend_list, which contains the
list of ranges which are currently suspended. On receiving RESYNCING,
the node adds the range to the suspend_list. Similarly, when the node
performing resync finishes, it sends RESYNCING with an empty range to
other nodes and other nodes remove the corresponding entry from the
suspend_list.
A helper function, ->area_resyncing() can be used to check if a
particular I/O range should be suspended or not.
4.2 Device Failure
==================
Device failures are handled and communicated with the metadata update
routine. When a node detects a device failure it does not allow
any further writes to that device until the failure has been
acknowledged by all other nodes.
5. Adding a new Device
----------------------
For adding a new device, it is necessary that all nodes "see" the new
device to be added. For this, the following algorithm is used:
1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD)
2. Node 1 sends a NEWDISK message with uuid and slot number
3. Other nodes issue kobject_uevent_env with uuid and slot number
(Steps 4,5 could be a udev rule)
4. In userspace, the node searches for the disk, perhaps
using blkid -t SUB_UUID=""
5. Other nodes issue either of the following depending on whether
the disk was found:
ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
disc.number set to slot number)
ioctl(CLUSTERED_DISK_NACK)
6. Other nodes drop lock on "no-new-devs" (CR) if device is found
7. Node 1 attempts EX lock on "no-new-dev"
8. If node 1 gets the lock, it sends METADATA_UPDATED after
unmarking the disk as SpareLocal
9. If not (get "no-new-dev" lock), it fails the operation and sends
METADATA_UPDATED.
10. Other nodes get the information whether a disk is added or not
by the following METADATA_UPDATED.
6. Module interface
===================
There are 17 call-backs which the md core can make to the cluster
module. Understanding these can give a good overview of the whole
process.
6.1 join(nodes) and leave()
---------------------------
These are called when an array is started with a clustered bitmap,
and when the array is stopped. join() ensures the cluster is
available and initializes the various resources.
Only the first 'nodes' nodes in the cluster can use the array.
6.2 slot_number()
-----------------
Reports the slot number advised by the cluster infrastructure.
Range is from 0 to nodes-1.
6.3 resync_info_update()
------------------------
This updates the resync range that is stored in the bitmap lock.
The starting point is updated as the resync progresses. The
end point is always the end of the array.
It does *not* send a RESYNCING message.
6.4 resync_start(), resync_finish()
-----------------------------------
These are called when resync/recovery/reshape starts or stops.
They update the resyncing range in the bitmap lock and also
send a RESYNCING message. resync_start reports the whole
array as resyncing, resync_finish reports none of it.
resync_finish() also sends a BITMAP_NEEDS_SYNC message which
allows some other node to take over.
6.5 metadata_update_start(), metadata_update_finish(), metadata_update_cancel()
-------------------------------------------------------------------------------
metadata_update_start is used to get exclusive access to
the metadata. If a change is still needed once that access is
gained, metadata_update_finish() will send a METADATA_UPDATE
message to all other nodes, otherwise metadata_update_cancel()
can be used to release the lock.
6.6 area_resyncing()
--------------------
This combines two elements of functionality.
Firstly, it will check if any node is currently resyncing
anything in a given range of sectors. If any resync is found,
then the caller will avoid writing or read-balancing in that
range.
Secondly, while node recovery is happening it reports that
all areas are resyncing for READ requests. This avoids races
between the cluster-filesystem and the cluster-RAID handling
a node failure.
6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack()
---------------------------------------------------------------
These are used to manage the new-disk protocol described above.
When a new device is added, add_new_disk_start() is called before
it is bound to the array and, if that succeeds, add_new_disk_finish()
is called the device is fully added.
When a device is added in acknowledgement to a previous
request, or when the device is declared "unavailable",
new_disk_ack() is called.
6.8 remove_disk()
-----------------
This is called when a spare or failed device is removed from
the array. It causes a REMOVE message to be send to other nodes.
6.9 gather_bitmaps()
--------------------
This sends a RE_ADD message to all other nodes and then
gathers bitmap information from all bitmaps. This combined
bitmap is then used to recovery the re-added device.
6.10 lock_all_bitmaps() and unlock_all_bitmaps()
------------------------------------------------
These are called when change bitmap to none. If a node plans
to clear the cluster raid's bitmap, it need to make sure no other
nodes are using the raid which is achieved by lock all bitmap
locks within the cluster, and also those locks are unlocked
accordingly.
7. Unsupported features
=======================
There are somethings which are not supported by cluster MD yet.
- change array_sectors.

View File

@@ -0,0 +1,111 @@
================
RAID 4/5/6 cache
================
Raid 4/5/6 could include an extra disk for data cache besides normal RAID
disks. The role of RAID disks isn't changed with the cache disk. The cache disk
caches data to the RAID disks. The cache can be in write-through (supported
since 4.4) or write-back mode (supported since 4.10). mdadm (supported since
3.4) has a new option '--write-journal' to create array with cache. Please
refer to mdadm manual for details. By default (RAID array starts), the cache is
in write-through mode. A user can switch it to write-back mode by::
echo "write-back" > /sys/block/md0/md/journal_mode
And switch it back to write-through mode by::
echo "write-through" > /sys/block/md0/md/journal_mode
In both modes, all writes to the array will hit cache disk first. This means
the cache disk must be fast and sustainable.
write-through mode
==================
This mode mainly fixes the 'write hole' issue. For RAID 4/5/6 array, an unclean
shutdown can cause data in some stripes to not be in consistent state, eg, data
and parity don't match. The reason is that a stripe write involves several RAID
disks and it's possible the writes don't hit all RAID disks yet before the
unclean shutdown. We call an array degraded if it has inconsistent data. MD
tries to resync the array to bring it back to normal state. But before the
resync completes, any system crash will expose the chance of real data
corruption in the RAID array. This problem is called 'write hole'.
The write-through cache will cache all data on cache disk first. After the data
is safe on the cache disk, the data will be flushed onto RAID disks. The
two-step write will guarantee MD can recover correct data after unclean
shutdown even the array is degraded. Thus the cache can close the 'write hole'.
In write-through mode, MD reports IO completion to upper layer (usually
filesystems) after the data is safe on RAID disks, so cache disk failure
doesn't cause data loss. Of course cache disk failure means the array is
exposed to 'write hole' again.
In write-through mode, the cache disk isn't required to be big. Several
hundreds megabytes are enough.
write-back mode
===============
write-back mode fixes the 'write hole' issue too, since all write data is
cached on cache disk. But the main goal of 'write-back' cache is to speed up
write. If a write crosses all RAID disks of a stripe, we call it full-stripe
write. For non-full-stripe writes, MD must read old data before the new parity
can be calculated. These synchronous reads hurt write throughput. Some writes
which are sequential but not dispatched in the same time will suffer from this
overhead too. Write-back cache will aggregate the data and flush the data to
RAID disks only after the data becomes a full stripe write. This will
completely avoid the overhead, so it's very helpful for some workloads. A
typical workload which does sequential write followed by fsync is an example.
In write-back mode, MD reports IO completion to upper layer (usually
filesystems) right after the data hits cache disk. The data is flushed to raid
disks later after specific conditions met. So cache disk failure will cause
data loss.
In write-back mode, MD also caches data in memory. The memory cache includes
the same data stored on cache disk, so a power loss doesn't cause data loss.
The memory cache size has performance impact for the array. It's recommended
the size is big. A user can configure the size by::
echo "2048" > /sys/block/md0/md/stripe_cache_size
Too small cache disk will make the write aggregation less efficient in this
mode depending on the workloads. It's recommended to use a cache disk with at
least several gigabytes size in write-back mode.
The implementation
==================
The write-through and write-back cache use the same disk format. The cache disk
is organized as a simple write log. The log consists of 'meta data' and 'data'
pairs. The meta data describes the data. It also includes checksum and sequence
ID for recovery identification. Data can be IO data and parity data. Data is
checksumed too. The checksum is stored in the meta data ahead of the data. The
checksum is an optimization because MD can write meta and data freely without
worry about the order. MD superblock has a field pointed to the valid meta data
of log head.
The log implementation is pretty straightforward. The difficult part is the
order in which MD writes data to cache disk and RAID disks. Specifically, in
write-through mode, MD calculates parity for IO data, writes both IO data and
parity to the log, writes the data and parity to RAID disks after the data and
parity is settled down in log and finally the IO is finished. Read just reads
from raid disks as usual.
In write-back mode, MD writes IO data to the log and reports IO completion. The
data is also fully cached in memory at that time, which means read must query
memory cache. If some conditions are met, MD will flush the data to RAID disks.
MD will calculate parity for the data and write parity into the log. After this
is finished, MD will write both data and parity into RAID disks, then MD can
release the memory cache. The flush conditions could be stripe becomes a full
stripe write, free cache disk space is low or free in-kernel memory cache space
is low.
After an unclean shutdown, MD does recovery. MD reads all meta data and data
from the log. The sequence ID and checksum will help us detect corrupted meta
data and data. If MD finds a stripe with data and valid parities (1 parity for
raid4/5 and 2 for raid6), MD will write the data and parities to RAID disks. If
parities are incompleted, they are discarded. If part of data is corrupted,
they are discarded too. MD then loads valid data and writes them to RAID disks
in normal way.

View File

@@ -0,0 +1,47 @@
==================
Partial Parity Log
==================
Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
addressed by PPL is that after a dirty shutdown, parity of a particular stripe
may become inconsistent with data on other member disks. If the array is also
in degraded state, there is no way to recalculate parity, because one of the
disks is missing. This can lead to silent data corruption when rebuilding the
array or using it is as degraded - data calculated from parity for array blocks
that have not been touched by a write request during the unclean shutdown can
be incorrect. Such condition is known as the RAID5 Write Hole. Because of
this, md by default does not allow starting a dirty degraded array.
Partial parity for a write operation is the XOR of stripe data chunks not
modified by this write. It is just enough data needed for recovering from the
write hole. XORing partial parity with the modified chunks produces parity for
the stripe, consistent with its state before the write operation, regardless of
which chunk writes have completed. If one of the not modified data disks of
this stripe is missing, this updated parity can be used to recover its
contents. PPL recovery is also performed when starting an array after an
unclean shutdown and all disks are available, eliminating the need to resync
the array. Because of this, using write-intent bitmap and PPL together is not
supported.
When handling a write request PPL writes partial parity before new data and
parity are dispatched to disks. PPL is a distributed log - it is stored on
array member drives in the metadata area, on the parity drive of a particular
stripe. It does not require a dedicated journaling drive. Write performance is
reduced by up to 30%-40% but it scales with the number of drives in the array
and the journaling drive does not become a bottleneck or a single point of
failure.
Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
not a true journal. It does not protect from losing in-flight data, only from
silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
performed for this stripe (parity is not updated). So it is possible to have
arbitrary data in the written part of a stripe if that disk is lost. In such
case the behavior is the same as in plain raid5.
PPL is available for md version-1 metadata and external (specifically IMSM)
metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.
There is a limitation of maximum 64 disks in the array for PPL. It allows to
keep data structures and implementation simple. RAID5 arrays with so many disks
are not likely due to high risk of multiple disks failure. Such restriction
should not be a real life limitation.

View File

@@ -0,0 +1,18 @@
.. SPDX-License-Identifier: GPL-2.0
=========================
Memory Controller drivers
=========================
.. toctree::
:maxdepth: 1
ti-emif
ti-gpmc
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@@ -0,0 +1,64 @@
.. SPDX-License-Identifier: GPL-2.0
===============================
TI EMIF SDRAM Controller Driver
===============================
Author
======
Aneesh V <aneesh@ti.com>
Location
========
driver/memory/emif.c
Supported SoCs:
===============
TI OMAP44xx
TI OMAP54xx
Menuconfig option:
==================
Device Drivers
Memory devices
Texas Instruments EMIF driver
Description
===========
This driver is for the EMIF module available in Texas Instruments
SoCs. EMIF is an SDRAM controller that, based on its revision,
supports one or more of DDR2, DDR3, and LPDDR2 SDRAM protocols.
This driver takes care of only LPDDR2 memories presently. The
functions of the driver includes re-configuring AC timing
parameters and other settings during frequency, voltage and
temperature changes
Platform Data (see include/linux/platform_data/emif_plat.h)
===========================================================
DDR device details and other board dependent and SoC dependent
information can be passed through platform data (struct emif_platform_data)
- DDR device details: 'struct ddr_device_info'
- Device AC timings: 'struct lpddr2_timings' and 'struct lpddr2_min_tck'
- Custom configurations: customizable policy options through
'struct emif_custom_configs'
- IP revision
- PHY type
Interface to the external world
===============================
EMIF driver registers notifiers for voltage and frequency changes
affecting EMIF and takes appropriate actions when these are invoked.
- freq_pre_notify_handling()
- freq_post_notify_handling()
- volt_notify_handling()
Debugfs
=======
The driver creates two debugfs entries per device.
- regcache_dump : dump of register values calculated and saved for all
frequencies used so far.
- mr4 : last polled value of MR4 register in the LPDDR2 device. MR4
indicates the current temperature level of the device.

View File

@@ -0,0 +1,179 @@
.. SPDX-License-Identifier: GPL-2.0
========================================
GPMC (General Purpose Memory Controller)
========================================
GPMC is an unified memory controller dedicated to interfacing external
memory devices like
* Asynchronous SRAM like memories and application specific integrated
circuit devices.
* Asynchronous, synchronous, and page mode burst NOR flash devices
NAND flash
* Pseudo-SRAM devices
GPMC is found on Texas Instruments SoC's (OMAP based)
IP details: http://www.ti.com/lit/pdf/spruh73 section 7.1
GPMC generic timing calculation:
================================
GPMC has certain timings that has to be programmed for proper
functioning of the peripheral, while peripheral has another set of
timings. To have peripheral work with gpmc, peripheral timings has to
be translated to the form gpmc can understand. The way it has to be
translated depends on the connected peripheral. Also there is a
dependency for certain gpmc timings on gpmc clock frequency. Hence a
generic timing routine was developed to achieve above requirements.
Generic routine provides a generic method to calculate gpmc timings
from gpmc peripheral timings. struct gpmc_device_timings fields has to
be updated with timings from the datasheet of the peripheral that is
connected to gpmc. A few of the peripheral timings can be fed either
in time or in cycles, provision to handle this scenario has been
provided (refer struct gpmc_device_timings definition). It may so
happen that timing as specified by peripheral datasheet is not present
in timing structure, in this scenario, try to correlate peripheral
timing to the one available. If that doesn't work, try to add a new
field as required by peripheral, educate generic timing routine to
handle it, make sure that it does not break any of the existing.
Then there may be cases where peripheral datasheet doesn't mention
certain fields of struct gpmc_device_timings, zero those entries.
Generic timing routine has been verified to work properly on
multiple onenand's and tusb6010 peripherals.
A word of caution: generic timing routine has been developed based
on understanding of gpmc timings, peripheral timings, available
custom timing routines, a kind of reverse engineering without
most of the datasheets & hardware (to be exact none of those supported
in mainline having custom timing routine) and by simulation.
gpmc timing dependency on peripheral timings:
[<gpmc_timing>: <peripheral timing1>, <peripheral timing2> ...]
1. common
cs_on:
t_ceasu
adv_on:
t_avdasu, t_ceavd
2. sync common
sync_clk:
clk
page_burst_access:
t_bacc
clk_activation:
t_ces, t_avds
3. read async muxed
adv_rd_off:
t_avdp_r
oe_on:
t_oeasu, t_aavdh
access:
t_iaa, t_oe, t_ce, t_aa
rd_cycle:
t_rd_cycle, t_cez_r, t_oez
4. read async non-muxed
adv_rd_off:
t_avdp_r
oe_on:
t_oeasu
access:
t_iaa, t_oe, t_ce, t_aa
rd_cycle:
t_rd_cycle, t_cez_r, t_oez
5. read sync muxed
adv_rd_off:
t_avdp_r, t_avdh
oe_on:
t_oeasu, t_ach, cyc_aavdh_oe
access:
t_iaa, cyc_iaa, cyc_oe
rd_cycle:
t_cez_r, t_oez, t_ce_rdyz
6. read sync non-muxed
adv_rd_off:
t_avdp_r
oe_on:
t_oeasu
access:
t_iaa, cyc_iaa, cyc_oe
rd_cycle:
t_cez_r, t_oez, t_ce_rdyz
7. write async muxed
adv_wr_off:
t_avdp_w
we_on, wr_data_mux_bus:
t_weasu, t_aavdh, cyc_aavhd_we
we_off:
t_wpl
cs_wr_off:
t_wph
wr_cycle:
t_cez_w, t_wr_cycle
8. write async non-muxed
adv_wr_off:
t_avdp_w
we_on, wr_data_mux_bus:
t_weasu
we_off:
t_wpl
cs_wr_off:
t_wph
wr_cycle:
t_cez_w, t_wr_cycle
9. write sync muxed
adv_wr_off:
t_avdp_w, t_avdh
we_on, wr_data_mux_bus:
t_weasu, t_rdyo, t_aavdh, cyc_aavhd_we
we_off:
t_wpl, cyc_wpl
cs_wr_off:
t_wph
wr_cycle:
t_cez_w, t_ce_rdyz
10. write sync non-muxed
adv_wr_off:
t_avdp_w
we_on, wr_data_mux_bus:
t_weasu, t_rdyo
we_off:
t_wpl, cyc_wpl
cs_wr_off:
t_wph
wr_cycle:
t_cez_w, t_ce_rdyz
Note:
Many of gpmc timings are dependent on other gpmc timings (a few
gpmc timings purely dependent on other gpmc timings, a reason that
some of the gpmc timings are missing above), and it will result in
indirect dependency of peripheral timings to gpmc timings other than
mentioned above, refer timing routine for more details. To know what
these peripheral timings correspond to, please see explanations in
struct gpmc_device_timings definition. And for gpmc timings refer
IP details (link above).

View File

@@ -0,0 +1,175 @@
=================
MEN Chameleon Bus
=================
.. Table of Contents
=================
1 Introduction
1.1 Scope of this Document
1.2 Limitations of the current implementation
2 Architecture
2.1 MEN Chameleon Bus
2.2 Carrier Devices
2.3 Parser
3 Resource handling
3.1 Memory Resources
3.2 IRQs
4 Writing an MCB driver
4.1 The driver structure
4.2 Probing and attaching
4.3 Initializing the driver
Introduction
============
This document describes the architecture and implementation of the MEN
Chameleon Bus (called MCB throughout this document).
Scope of this Document
----------------------
This document is intended to be a short overview of the current
implementation and does by no means describe the complete possibilities of MCB
based devices.
Limitations of the current implementation
-----------------------------------------
The current implementation is limited to PCI and PCIe based carrier devices
that only use a single memory resource and share the PCI legacy IRQ. Not
implemented are:
- Multi-resource MCB devices like the VME Controller or M-Module carrier.
- MCB devices that need another MCB device, like SRAM for a DMA Controller's
buffer descriptors or a video controller's video memory.
- A per-carrier IRQ domain for carrier devices that have one (or more) IRQs
per MCB device like PCIe based carriers with MSI or MSI-X support.
Architecture
============
MCB is divided into 3 functional blocks:
- The MEN Chameleon Bus itself,
- drivers for MCB Carrier Devices and
- the parser for the Chameleon table.
MEN Chameleon Bus
-----------------
The MEN Chameleon Bus is an artificial bus system that attaches to a so
called Chameleon FPGA device found on some hardware produced my MEN Mikro
Elektronik GmbH. These devices are multi-function devices implemented in a
single FPGA and usually attached via some sort of PCI or PCIe link. Each
FPGA contains a header section describing the content of the FPGA. The
header lists the device id, PCI BAR, offset from the beginning of the PCI
BAR, size in the FPGA, interrupt number and some other properties currently
not handled by the MCB implementation.
Carrier Devices
---------------
A carrier device is just an abstraction for the real world physical bus the
Chameleon FPGA is attached to. Some IP Core drivers may need to interact with
properties of the carrier device (like querying the IRQ number of a PCI
device). To provide abstraction from the real hardware bus, an MCB carrier
device provides callback methods to translate the driver's MCB function calls
to hardware related function calls. For example a carrier device may
implement the get_irq() method which can be translated into a hardware bus
query for the IRQ number the device should use.
Parser
------
The parser reads the first 512 bytes of a Chameleon device and parses the
Chameleon table. Currently the parser only supports the Chameleon v2 variant
of the Chameleon table but can easily be adopted to support an older or
possible future variant. While parsing the table's entries new MCB devices
are allocated and their resources are assigned according to the resource
assignment in the Chameleon table. After resource assignment is finished, the
MCB devices are registered at the MCB and thus at the driver core of the
Linux kernel.
Resource handling
=================
The current implementation assigns exactly one memory and one IRQ resource
per MCB device. But this is likely going to change in the future.
Memory Resources
----------------
Each MCB device has exactly one memory resource, which can be requested from
the MCB bus. This memory resource is the physical address of the MCB device
inside the carrier and is intended to be passed to ioremap() and friends. It
is already requested from the kernel by calling request_mem_region().
IRQs
----
Each MCB device has exactly one IRQ resource, which can be requested from the
MCB bus. If a carrier device driver implements the ->get_irq() callback
method, the IRQ number assigned by the carrier device will be returned,
otherwise the IRQ number inside the Chameleon table will be returned. This
number is suitable to be passed to request_irq().
Writing an MCB driver
=====================
The driver structure
--------------------
Each MCB driver has a structure to identify the device driver as well as
device ids which identify the IP Core inside the FPGA. The driver structure
also contains callback methods which get executed on driver probe and
removal from the system::
static const struct mcb_device_id foo_ids[] = {
{ .device = 0x123 },
{ }
};
MODULE_DEVICE_TABLE(mcb, foo_ids);
static struct mcb_driver foo_driver = {
driver = {
.name = "foo-bar",
.owner = THIS_MODULE,
},
.probe = foo_probe,
.remove = foo_remove,
.id_table = foo_ids,
};
Probing and attaching
---------------------
When a driver is loaded and the MCB devices it services are found, the MCB
core will call the driver's probe callback method. When the driver is removed
from the system, the MCB core will call the driver's remove callback method::
static init foo_probe(struct mcb_device *mdev, const struct mcb_device_id *id);
static void foo_remove(struct mcb_device *mdev);
Initializing the driver
-----------------------
When the kernel is booted or your foo driver module is inserted, you have to
perform driver initialization. Usually it is enough to register your driver
module at the MCB core::
static int __init foo_init(void)
{
return mcb_register_driver(&foo_driver);
}
module_init(foo_init);
static void __exit foo_exit(void)
{
mcb_unregister_driver(&foo_driver);
}
module_exit(foo_exit);
The module_mcb_driver() macro can be used to reduce the above code::
module_mcb_driver(foo_driver);

View File

@@ -0,0 +1,13 @@
.. SPDX-License-Identifier: GPL-2.0
========================
MMC/SD/SDIO card support
========================
.. toctree::
:maxdepth: 1
mmc-dev-attrs
mmc-dev-parts
mmc-async-req
mmc-tools

View File

@@ -0,0 +1,98 @@
========================
MMC Asynchronous Request
========================
Rationale
=========
How significant is the cache maintenance overhead?
It depends. Fast eMMC and multiple cache levels with speculative cache
pre-fetch makes the cache overhead relatively significant. If the DMA
preparations for the next request are done in parallel with the current
transfer, the DMA preparation overhead would not affect the MMC performance.
The intention of non-blocking (asynchronous) MMC requests is to minimize the
time between when an MMC request ends and another MMC request begins.
Using mmc_wait_for_req(), the MMC controller is idle while dma_map_sg and
dma_unmap_sg are processing. Using non-blocking MMC requests makes it
possible to prepare the caches for next job in parallel with an active
MMC request.
MMC block driver
================
The mmc_blk_issue_rw_rq() in the MMC block driver is made non-blocking.
The increase in throughput is proportional to the time it takes to
prepare (major part of preparations are dma_map_sg() and dma_unmap_sg())
a request and how fast the memory is. The faster the MMC/SD is the
more significant the prepare request time becomes. Roughly the expected
performance gain is 5% for large writes and 10% on large reads on a L2 cache
platform. In power save mode, when clocks run on a lower frequency, the DMA
preparation may cost even more. As long as these slower preparations are run
in parallel with the transfer performance won't be affected.
Details on measurements from IOZone and mmc_test
================================================
https://wiki.linaro.org/WorkingGroups/Kernel/Specs/StoragePerfMMC-async-req
MMC core API extension
======================
There is one new public function mmc_start_req().
It starts a new MMC command request for a host. The function isn't
truly non-blocking. If there is an ongoing async request it waits
for completion of that request and starts the new one and returns. It
doesn't wait for the new request to complete. If there is no ongoing
request it starts the new request and returns immediately.
MMC host extensions
===================
There are two optional members in the mmc_host_ops -- pre_req() and
post_req() -- that the host driver may implement in order to move work
to before and after the actual mmc_host_ops.request() function is called.
In the DMA case pre_req() may do dma_map_sg() and prepare the DMA
descriptor, and post_req() runs the dma_unmap_sg().
Optimize for the first request
==============================
The first request in a series of requests can't be prepared in parallel
with the previous transfer, since there is no previous request.
The argument is_first_req in pre_req() indicates that there is no previous
request. The host driver may optimize for this scenario to minimize
the performance loss. A way to optimize for this is to split the current
request in two chunks, prepare the first chunk and start the request,
and finally prepare the second chunk and start the transfer.
Pseudocode to handle is_first_req scenario with minimal prepare overhead::
if (is_first_req && req->size > threshold)
/* start MMC transfer for the complete transfer size */
mmc_start_command(MMC_CMD_TRANSFER_FULL_SIZE);
/*
* Begin to prepare DMA while cmd is being processed by MMC.
* The first chunk of the request should take the same time
* to prepare as the "MMC process command time".
* If prepare time exceeds MMC cmd time
* the transfer is delayed, guesstimate max 4k as first chunk size.
*/
prepare_1st_chunk_for_dma(req);
/* flush pending desc to the DMAC (dmaengine.h) */
dma_issue_pending(req->dma_desc);
prepare_2nd_chunk_for_dma(req);
/*
* The second issue_pending should be called before MMC runs out
* of the first chunk. If the MMC runs out of the first data chunk
* before this call, the transfer is delayed.
*/
dma_issue_pending(req->dma_desc);

View File

@@ -0,0 +1,91 @@
==================================
SD and MMC Block Device Attributes
==================================
These attributes are defined for the block devices associated with the
SD or MMC device.
The following attributes are read/write.
======== ===============================================
force_ro Enforce read-only access even if write protect switch is off.
======== ===============================================
SD and MMC Device Attributes
============================
All attributes are read-only.
====================== ===============================================
cid Card Identification Register
csd Card Specific Data Register
scr SD Card Configuration Register (SD only)
date Manufacturing Date (from CID Register)
fwrev Firmware/Product Revision (from CID Register)
(SD and MMCv1 only)
hwrev Hardware/Product Revision (from CID Register)
(SD and MMCv1 only)
manfid Manufacturer ID (from CID Register)
name Product Name (from CID Register)
oemid OEM/Application ID (from CID Register)
prv Product Revision (from CID Register)
(SD and MMCv4 only)
serial Product Serial Number (from CID Register)
erase_size Erase group size
preferred_erase_size Preferred erase size
raw_rpmb_size_mult RPMB partition size
rel_sectors Reliable write sector count
ocr Operation Conditions Register
dsr Driver Stage Register
cmdq_en Command Queue enabled:
1 => enabled, 0 => not enabled
====================== ===============================================
Note on Erase Size and Preferred Erase Size:
"erase_size" is the minimum size, in bytes, of an erase
operation. For MMC, "erase_size" is the erase group size
reported by the card. Note that "erase_size" does not apply
to trim or secure trim operations where the minimum size is
always one 512 byte sector. For SD, "erase_size" is 512
if the card is block-addressed, 0 otherwise.
SD/MMC cards can erase an arbitrarily large area up to and
including the whole card. When erasing a large area it may
be desirable to do it in smaller chunks for three reasons:
1. A single erase command will make all other I/O on
the card wait. This is not a problem if the whole card
is being erased, but erasing one partition will make
I/O for another partition on the same card wait for the
duration of the erase - which could be a several
minutes.
2. To be able to inform the user of erase progress.
3. The erase timeout becomes too large to be very
useful. Because the erase timeout contains a margin
which is multiplied by the size of the erase area,
the value can end up being several minutes for large
areas.
"erase_size" is not the most efficient unit to erase
(especially for SD where it is just one sector),
hence "preferred_erase_size" provides a good chunk
size for erasing large areas.
For MMC, "preferred_erase_size" is the high-capacity
erase size if a card specifies one, otherwise it is
based on the capacity of the card.
For SD, "preferred_erase_size" is the allocation unit
size specified by the card.
"preferred_erase_size" is in bytes.
Note on raw_rpmb_size_mult:
"raw_rpmb_size_mult" is a multiple of 128kB block.
RPMB size in byte is calculated by using the following equation:
RPMB partition size = 128kB x raw_rpmb_size_mult

View File

@@ -0,0 +1,41 @@
============================
SD and MMC Device Partitions
============================
Device partitions are additional logical block devices present on the
SD/MMC device.
As of this writing, MMC boot partitions as supported and exposed as
/dev/mmcblkXboot0 and /dev/mmcblkXboot1, where X is the index of the
parent /dev/mmcblkX.
MMC Boot Partitions
===================
Read and write access is provided to the two MMC boot partitions. Due to
the sensitive nature of the boot partition contents, which often store
a bootloader or bootloader configuration tables crucial to booting the
platform, write access is disabled by default to reduce the chance of
accidental bricking.
To enable write access to /dev/mmcblkXbootY, disable the forced read-only
access with::
echo 0 > /sys/block/mmcblkXbootY/force_ro
To re-enable read-only access::
echo 1 > /sys/block/mmcblkXbootY/force_ro
The boot partitions can also be locked read only until the next power on,
with::
echo 1 > /sys/block/mmcblkXbootY/ro_lock_until_next_power_on
This is a feature of the card and not of the kernel. If the card does
not support boot partition locking, the file will not exist. If the
feature has been disabled on the card, the file will be read-only.
The boot partitions can also be locked permanently, but this feature is
not accessible through sysfs in order to avoid accidental or malicious
bricking.

View File

@@ -0,0 +1,37 @@
======================
MMC tools introduction
======================
There is one MMC test tools called mmc-utils, which is maintained by Chris Ball,
you can find it at the below public git repository:
http://git.kernel.org/cgit/linux/kernel/git/cjb/mmc-utils.git/
Functions
=========
The mmc-utils tools can do the following:
- Print and parse extcsd data.
- Determine the eMMC writeprotect status.
- Set the eMMC writeprotect status.
- Set the eMMC data sector size to 4KB by disabling emulation.
- Create general purpose partition.
- Enable the enhanced user area.
- Enable write reliability per partition.
- Print the response to STATUS_SEND (CMD13).
- Enable the boot partition.
- Set Boot Bus Conditions.
- Enable the eMMC BKOPS feature.
- Permanently enable the eMMC H/W Reset feature.
- Permanently disable the eMMC H/W Reset feature.
- Send Sanitize command.
- Program authentication key for the device.
- Counter value for the rpmb device will be read to stdout.
- Read from rpmb device to output.
- Write to rpmb device from data file.
- Enable the eMMC cache feature.
- Disable the eMMC cache feature.
- Print and parse CID data.
- Print and parse CSD data.
- Print and parse SCR data.

View File

@@ -0,0 +1,12 @@
.. SPDX-License-Identifier: GPL-2.0
==============================
Memory Technology Device (MTD)
==============================
.. toctree::
:maxdepth: 1
intel-spi
nand_ecc
spi-nor

View File

@@ -0,0 +1,90 @@
==============================
Upgrading BIOS using intel-spi
==============================
Many Intel CPUs like Baytrail and Braswell include SPI serial flash host
controller which is used to hold BIOS and other platform specific data.
Since contents of the SPI serial flash is crucial for machine to function,
it is typically protected by different hardware protection mechanisms to
avoid accidental (or on purpose) overwrite of the content.
Not all manufacturers protect the SPI serial flash, mainly because it
allows upgrading the BIOS image directly from an OS.
The intel-spi driver makes it possible to read and write the SPI serial
flash, if certain protection bits are not set and locked. If it finds
any of them set, the whole MTD device is made read-only to prevent
partial overwrites. By default the driver exposes SPI serial flash
contents as read-only but it can be changed from kernel command line,
passing "intel-spi.writeable=1".
Please keep in mind that overwriting the BIOS image on SPI serial flash
might render the machine unbootable and requires special equipment like
Dediprog to revive. You have been warned!
Below are the steps how to upgrade MinnowBoard MAX BIOS directly from
Linux.
1) Download and extract the latest Minnowboard MAX BIOS SPI image
[1]. At the time writing this the latest image is v92.
2) Install mtd-utils package [2]. We need this in order to erase the SPI
serial flash. Distros like Debian and Fedora have this prepackaged with
name "mtd-utils".
3) Add "intel-spi.writeable=1" to the kernel command line and reboot
the board (you can also reload the driver passing "writeable=1" as
module parameter to modprobe).
4) Once the board is up and running again, find the right MTD partition
(it is named as "BIOS")::
# cat /proc/mtd
dev: size erasesize name
mtd0: 00800000 00001000 "BIOS"
So here it will be /dev/mtd0 but it may vary.
5) Make backup of the existing image first::
# dd if=/dev/mtd0ro of=bios.bak
16384+0 records in
16384+0 records out
8388608 bytes (8.4 MB) copied, 10.0269 s, 837 kB/s
6) Verify the backup:
# sha1sum /dev/mtd0ro bios.bak
fdbb011920572ca6c991377c4b418a0502668b73 /dev/mtd0ro
fdbb011920572ca6c991377c4b418a0502668b73 bios.bak
The SHA1 sums must match. Otherwise do not continue any further!
7) Erase the SPI serial flash. After this step, do not reboot the
board! Otherwise it will not start anymore::
# flash_erase /dev/mtd0 0 0
Erasing 4 Kibyte @ 7ff000 -- 100 % complete
8) Once completed without errors you can write the new BIOS image:
# dd if=MNW2MAX1.X64.0092.R01.1605221712.bin of=/dev/mtd0
9) Verify that the new content of the SPI serial flash matches the new
BIOS image::
# sha1sum /dev/mtd0ro MNW2MAX1.X64.0092.R01.1605221712.bin
9b4df9e4be2057fceec3a5529ec3d950836c87a2 /dev/mtd0ro
9b4df9e4be2057fceec3a5529ec3d950836c87a2 MNW2MAX1.X64.0092.R01.1605221712.bin
The SHA1 sums should match.
10) Now you can reboot your board and observe the new BIOS starting up
properly.
References
----------
[1] https://firmware.intel.com/sites/default/files/MinnowBoard%2EMAX_%2EX64%2E92%2ER01%2Ezip
[2] http://www.linux-mtd.infradead.org/

View File

@@ -0,0 +1,763 @@
==========================
NAND Error-correction Code
==========================
Introduction
============
Having looked at the linux mtd/nand driver and more specific at nand_ecc.c
I felt there was room for optimisation. I bashed the code for a few hours
performing tricks like table lookup removing superfluous code etc.
After that the speed was increased by 35-40%.
Still I was not too happy as I felt there was additional room for improvement.
Bad! I was hooked.
I decided to annotate my steps in this file. Perhaps it is useful to someone
or someone learns something from it.
The problem
===========
NAND flash (at least SLC one) typically has sectors of 256 bytes.
However NAND flash is not extremely reliable so some error detection
(and sometimes correction) is needed.
This is done by means of a Hamming code. I'll try to explain it in
laymans terms (and apologies to all the pro's in the field in case I do
not use the right terminology, my coding theory class was almost 30
years ago, and I must admit it was not one of my favourites).
As I said before the ecc calculation is performed on sectors of 256
bytes. This is done by calculating several parity bits over the rows and
columns. The parity used is even parity which means that the parity bit = 1
if the data over which the parity is calculated is 1 and the parity bit = 0
if the data over which the parity is calculated is 0. So the total
number of bits over the data over which the parity is calculated + the
parity bit is even. (see wikipedia if you can't follow this).
Parity is often calculated by means of an exclusive or operation,
sometimes also referred to as xor. In C the operator for xor is ^
Back to ecc.
Let's give a small figure:
========= ==== ==== ==== ==== ==== ==== ==== ==== === === === === ====
byte 0: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp4 ... rp14
byte 1: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp2 rp4 ... rp14
byte 2: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp4 ... rp14
byte 3: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp4 ... rp14
byte 4: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp5 ... rp14
...
byte 254: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp5 ... rp15
byte 255: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp5 ... rp15
cp1 cp0 cp1 cp0 cp1 cp0 cp1 cp0
cp3 cp3 cp2 cp2 cp3 cp3 cp2 cp2
cp5 cp5 cp5 cp5 cp4 cp4 cp4 cp4
========= ==== ==== ==== ==== ==== ==== ==== ==== === === === === ====
This figure represents a sector of 256 bytes.
cp is my abbreviation for column parity, rp for row parity.
Let's start to explain column parity.
- cp0 is the parity that belongs to all bit0, bit2, bit4, bit6.
so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even.
Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7.
- cp2 is the parity over bit0, bit1, bit4 and bit5
- cp3 is the parity over bit2, bit3, bit6 and bit7.
- cp4 is the parity over bit0, bit1, bit2 and bit3.
- cp5 is the parity over bit4, bit5, bit6 and bit7.
Note that each of cp0 .. cp5 is exactly one bit.
Row parity actually works almost the same.
- rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254)
- rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255)
- rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ...
(so handle two bytes, then skip 2 bytes).
- rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...)
- for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc.
so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...)
- and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, ..
The story now becomes quite boring. I guess you get the idea.
- rp6 covers 8 bytes then skips 8 etc
- rp7 skips 8 bytes then covers 8 etc
- rp8 covers 16 bytes then skips 16 etc
- rp9 skips 16 bytes then covers 16 etc
- rp10 covers 32 bytes then skips 32 etc
- rp11 skips 32 bytes then covers 32 etc
- rp12 covers 64 bytes then skips 64 etc
- rp13 skips 64 bytes then covers 64 etc
- rp14 covers 128 bytes then skips 128
- rp15 skips 128 bytes then covers 128
In the end the parity bits are grouped together in three bytes as
follows:
===== ===== ===== ===== ===== ===== ===== ===== =====
ECC Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
===== ===== ===== ===== ===== ===== ===== ===== =====
ECC 0 rp07 rp06 rp05 rp04 rp03 rp02 rp01 rp00
ECC 1 rp15 rp14 rp13 rp12 rp11 rp10 rp09 rp08
ECC 2 cp5 cp4 cp3 cp2 cp1 cp0 1 1
===== ===== ===== ===== ===== ===== ===== ===== =====
I detected after writing this that ST application note AN1823
(http://www.st.com/stonline/) gives a much
nicer picture.(but they use line parity as term where I use row parity)
Oh well, I'm graphically challenged, so suffer with me for a moment :-)
And I could not reuse the ST picture anyway for copyright reasons.
Attempt 0
=========
Implementing the parity calculation is pretty simple.
In C pseudocode::
for (i = 0; i < 256; i++)
{
if (i & 0x01)
rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
else
rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp0;
if (i & 0x02)
rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3;
else
rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2;
if (i & 0x04)
rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5;
else
rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4;
if (i & 0x08)
rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7;
else
rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6;
if (i & 0x10)
rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9;
else
rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8;
if (i & 0x20)
rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11;
else
rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10;
if (i & 0x40)
rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13;
else
rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12;
if (i & 0x80)
rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15;
else
rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14;
cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0;
cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1;
cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2;
cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3
cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4
cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5
}
Analysis 0
==========
C does have bitwise operators but not really operators to do the above
efficiently (and most hardware has no such instructions either).
Therefore without implementing this it was clear that the code above was
not going to bring me a Nobel prize :-)
Fortunately the exclusive or operation is commutative, so we can combine
the values in any order. So instead of calculating all the bits
individually, let us try to rearrange things.
For the column parity this is easy. We can just xor the bytes and in the
end filter out the relevant bits. This is pretty nice as it will bring
all cp calculation out of the for loop.
Similarly we can first xor the bytes for the various rows.
This leads to:
Attempt 1
=========
::
const char parity[256] = {
0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0
};
void ecc1(const unsigned char *buf, unsigned char *code)
{
int i;
const unsigned char *bp = buf;
unsigned char cur;
unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
unsigned char par;
par = 0;
rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
for (i = 0; i < 256; i++)
{
cur = *bp++;
par ^= cur;
if (i & 0x01) rp1 ^= cur; else rp0 ^= cur;
if (i & 0x02) rp3 ^= cur; else rp2 ^= cur;
if (i & 0x04) rp5 ^= cur; else rp4 ^= cur;
if (i & 0x08) rp7 ^= cur; else rp6 ^= cur;
if (i & 0x10) rp9 ^= cur; else rp8 ^= cur;
if (i & 0x20) rp11 ^= cur; else rp10 ^= cur;
if (i & 0x40) rp13 ^= cur; else rp12 ^= cur;
if (i & 0x80) rp15 ^= cur; else rp14 ^= cur;
}
code[0] =
(parity[rp7] << 7) |
(parity[rp6] << 6) |
(parity[rp5] << 5) |
(parity[rp4] << 4) |
(parity[rp3] << 3) |
(parity[rp2] << 2) |
(parity[rp1] << 1) |
(parity[rp0]);
code[1] =
(parity[rp15] << 7) |
(parity[rp14] << 6) |
(parity[rp13] << 5) |
(parity[rp12] << 4) |
(parity[rp11] << 3) |
(parity[rp10] << 2) |
(parity[rp9] << 1) |
(parity[rp8]);
code[2] =
(parity[par & 0xf0] << 7) |
(parity[par & 0x0f] << 6) |
(parity[par & 0xcc] << 5) |
(parity[par & 0x33] << 4) |
(parity[par & 0xaa] << 3) |
(parity[par & 0x55] << 2);
code[0] = ~code[0];
code[1] = ~code[1];
code[2] = ~code[2];
}
Still pretty straightforward. The last three invert statements are there to
give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash
all data is 0xff, so the checksum then matches.
I also introduced the parity lookup. I expected this to be the fastest
way to calculate the parity, but I will investigate alternatives later
on.
Analysis 1
==========
The code works, but is not terribly efficient. On my system it took
almost 4 times as much time as the linux driver code. But hey, if it was
*that* easy this would have been done long before.
No pain. no gain.
Fortunately there is plenty of room for improvement.
In step 1 we moved from bit-wise calculation to byte-wise calculation.
However in C we can also use the unsigned long data type and virtually
every modern microprocessor supports 32 bit operations, so why not try
to write our code in such a way that we process data in 32 bit chunks.
Of course this means some modification as the row parity is byte by
byte. A quick analysis:
for the column parity we use the par variable. When extending to 32 bits
we can in the end easily calculate rp0 and rp1 from it.
(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0
respectively, from MSB to LSB)
also rp2 and rp3 can be easily retrieved from par as rp3 covers the
first two MSBs and rp2 covers the last two LSBs.
Note that of course now the loop is executed only 64 times (256/4).
And note that care must taken wrt byte ordering. The way bytes are
ordered in a long is machine dependent, and might affect us.
Anyway, if there is an issue: this code is developed on x86 (to be
precise: a DELL PC with a D920 Intel CPU)
And of course the performance might depend on alignment, but I expect
that the I/O buffers in the nand driver are aligned properly (and
otherwise that should be fixed to get maximum performance).
Let's give it a try...
Attempt 2
=========
::
extern const char parity[256];
void ecc2(const unsigned char *buf, unsigned char *code)
{
int i;
const unsigned long *bp = (unsigned long *)buf;
unsigned long cur;
unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
unsigned long par;
par = 0;
rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
for (i = 0; i < 64; i++)
{
cur = *bp++;
par ^= cur;
if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
}
/*
we need to adapt the code generation for the fact that rp vars are now
long; also the column parity calculation needs to be changed.
we'll bring rp4 to 15 back to single byte entities by shifting and
xoring
*/
rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff;
rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff;
rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff;
rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff;
rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff;
rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff;
rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff;
rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff;
rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff;
rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff;
rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff;
rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff;
rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff;
rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff;
par ^= (par >> 16);
rp1 = (par >> 8); rp1 &= 0xff;
rp0 = (par & 0xff);
par ^= (par >> 8); par &= 0xff;
code[0] =
(parity[rp7] << 7) |
(parity[rp6] << 6) |
(parity[rp5] << 5) |
(parity[rp4] << 4) |
(parity[rp3] << 3) |
(parity[rp2] << 2) |
(parity[rp1] << 1) |
(parity[rp0]);
code[1] =
(parity[rp15] << 7) |
(parity[rp14] << 6) |
(parity[rp13] << 5) |
(parity[rp12] << 4) |
(parity[rp11] << 3) |
(parity[rp10] << 2) |
(parity[rp9] << 1) |
(parity[rp8]);
code[2] =
(parity[par & 0xf0] << 7) |
(parity[par & 0x0f] << 6) |
(parity[par & 0xcc] << 5) |
(parity[par & 0x33] << 4) |
(parity[par & 0xaa] << 3) |
(parity[par & 0x55] << 2);
code[0] = ~code[0];
code[1] = ~code[1];
code[2] = ~code[2];
}
The parity array is not shown any more. Note also that for these
examples I kinda deviated from my regular programming style by allowing
multiple statements on a line, not using { } in then and else blocks
with only a single statement and by using operators like ^=
Analysis 2
==========
The code (of course) works, and hurray: we are a little bit faster than
the linux driver code (about 15%). But wait, don't cheer too quickly.
There is more to be gained.
If we look at e.g. rp14 and rp15 we see that we either xor our data with
rp14 or with rp15. However we also have par which goes over all data.
This means there is no need to calculate rp14 as it can be calculated from
rp15 through rp14 = par ^ rp15, because par = rp14 ^ rp15;
(or if desired we can avoid calculating rp15 and calculate it from
rp14). That is why some places refer to inverse parity.
Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13.
Effectively this means we can eliminate the else clause from the if
statements. Also we can optimise the calculation in the end a little bit
by going from long to byte first. Actually we can even avoid the table
lookups
Attempt 3
=========
Odd replaced::
if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
with::
if (i & 0x01) rp5 ^= cur;
if (i & 0x02) rp7 ^= cur;
if (i & 0x04) rp9 ^= cur;
if (i & 0x08) rp11 ^= cur;
if (i & 0x10) rp13 ^= cur;
if (i & 0x20) rp15 ^= cur;
and outside the loop added::
rp4 = par ^ rp5;
rp6 = par ^ rp7;
rp8 = par ^ rp9;
rp10 = par ^ rp11;
rp12 = par ^ rp13;
rp14 = par ^ rp15;
And after that the code takes about 30% more time, although the number of
statements is reduced. This is also reflected in the assembly code.
Analysis 3
==========
Very weird. Guess it has to do with caching or instruction parallellism
or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting
observation was that this one is only 30% slower (according to time)
executing the code as my 3Ghz D920 processor.
Well, it was expected not to be easy so maybe instead move to a
different track: let's move back to the code from attempt2 and do some
loop unrolling. This will eliminate a few if statements. I'll try
different amounts of unrolling to see what works best.
Attempt 4
=========
Unrolled the loop 1, 2, 3 and 4 times.
For 4 the code starts with::
for (i = 0; i < 4; i++)
{
cur = *bp++;
par ^= cur;
rp4 ^= cur;
rp6 ^= cur;
rp8 ^= cur;
rp10 ^= cur;
if (i & 0x1) rp13 ^= cur; else rp12 ^= cur;
if (i & 0x2) rp15 ^= cur; else rp14 ^= cur;
cur = *bp++;
par ^= cur;
rp5 ^= cur;
rp6 ^= cur;
...
Analysis 4
==========
Unrolling once gains about 15%
Unrolling twice keeps the gain at about 15%
Unrolling three times gives a gain of 30% compared to attempt 2.
Unrolling four times gives a marginal improvement compared to unrolling
three times.
I decided to proceed with a four time unrolled loop anyway. It was my gut
feeling that in the next steps I would obtain additional gain from it.
The next step was triggered by the fact that par contains the xor of all
bytes and rp4 and rp5 each contain the xor of half of the bytes.
So in effect par = rp4 ^ rp5. But as xor is commutative we can also say
that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can
eliminate rp5 (or rp4, but I already foresaw another optimisation).
The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15.
Attempt 5
=========
Effectively so all odd digit rp assignments in the loop were removed.
This included the else clause of the if statements.
Of course after the loop we need to correct things by adding code like::
rp5 = par ^ rp4;
Also the initial assignments (rp5 = 0; etc) could be removed.
Along the line I also removed the initialisation of rp0/1/2/3.
Analysis 5
==========
Measurements showed this was a good move. The run-time roughly halved
compared with attempt 4 with 4 times unrolled, and we only require 1/3rd
of the processor time compared to the current code in the linux kernel.
However, still I thought there was more. I didn't like all the if
statements. Why not keep a running parity and only keep the last if
statement. Time for yet another version!
Attempt 6
=========
THe code within the for loop was changed to::
for (i = 0; i < 4; i++)
{
cur = *bp++; tmppar = cur; rp4 ^= cur;
cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
cur = *bp++; tmppar ^= cur; rp4 ^= cur;
cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
cur = *bp++; tmppar ^= cur; rp6 ^= cur;
cur = *bp++; tmppar ^= cur; rp4 ^= cur;
cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur;
cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur;
cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur;
cur = *bp++; tmppar ^= cur; rp8 ^= cur;
cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
cur = *bp++; tmppar ^= cur; rp6 ^= cur;
cur = *bp++; tmppar ^= cur; rp4 ^= cur;
cur = *bp++; tmppar ^= cur;
par ^= tmppar;
if ((i & 0x1) == 0) rp12 ^= tmppar;
if ((i & 0x2) == 0) rp14 ^= tmppar;
}
As you can see tmppar is used to accumulate the parity within a for
iteration. In the last 3 statements is added to par and, if needed,
to rp12 and rp14.
While making the changes I also found that I could exploit that tmppar
contains the running parity for this iteration. So instead of having:
rp4 ^= cur; rp6 ^= cur;
I removed the rp6 ^= cur; statement and did rp6 ^= tmppar; on next
statement. A similar change was done for rp8 and rp10
Analysis 6
==========
Measuring this code again showed big gain. When executing the original
linux code 1 million times, this took about 1 second on my system.
(using time to measure the performance). After this iteration I was back
to 0.075 sec. Actually I had to decide to start measuring over 10
million iterations in order not to lose too much accuracy. This one
definitely seemed to be the jackpot!
There is a little bit more room for improvement though. There are three
places with statements::
rp4 ^= cur; rp6 ^= cur;
It seems more efficient to also maintain a variable rp4_6 in the while
loop; This eliminates 3 statements per loop. Of course after the loop we
need to correct by adding::
rp4 ^= rp4_6;
rp6 ^= rp4_6
Furthermore there are 4 sequential assignments to rp8. This can be
encoded slightly more efficiently by saving tmppar before those 4 lines
and later do rp8 = rp8 ^ tmppar ^ notrp8;
(where notrp8 is the value of rp8 before those 4 lines).
Again a use of the commutative property of xor.
Time for a new test!
Attempt 7
=========
The new code now looks like::
for (i = 0; i < 4; i++)
{
cur = *bp++; tmppar = cur; rp4 ^= cur;
cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
cur = *bp++; tmppar ^= cur; rp4 ^= cur;
cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
cur = *bp++; tmppar ^= cur; rp6 ^= cur;
cur = *bp++; tmppar ^= cur; rp4 ^= cur;
cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
notrp8 = tmppar;
cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
cur = *bp++; tmppar ^= cur; rp6 ^= cur;
cur = *bp++; tmppar ^= cur; rp4 ^= cur;
cur = *bp++; tmppar ^= cur;
rp8 = rp8 ^ tmppar ^ notrp8;
cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
cur = *bp++; tmppar ^= cur; rp6 ^= cur;
cur = *bp++; tmppar ^= cur; rp4 ^= cur;
cur = *bp++; tmppar ^= cur;
par ^= tmppar;
if ((i & 0x1) == 0) rp12 ^= tmppar;
if ((i & 0x2) == 0) rp14 ^= tmppar;
}
rp4 ^= rp4_6;
rp6 ^= rp4_6;
Not a big change, but every penny counts :-)
Analysis 7
==========
Actually this made things worse. Not very much, but I don't want to move
into the wrong direction. Maybe something to investigate later. Could
have to do with caching again.
Guess that is what there is to win within the loop. Maybe unrolling one
more time will help. I'll keep the optimisations from 7 for now.
Attempt 8
=========
Unrolled the loop one more time.
Analysis 8
==========
This makes things worse. Let's stick with attempt 6 and continue from there.
Although it seems that the code within the loop cannot be optimised
further there is still room to optimize the generation of the ecc codes.
We can simply calculate the total parity. If this is 0 then rp4 = rp5
etc. If the parity is 1, then rp4 = !rp5;
But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits
in the result byte and then do something like::
code[0] |= (code[0] << 1);
Lets test this.
Attempt 9
=========
Changed the code but again this slightly degrades performance. Tried all
kind of other things, like having dedicated parity arrays to avoid the
shift after parity[rp7] << 7; No gain.
Change the lookup using the parity array by using shift operators (e.g.
replace parity[rp7] << 7 with::
rp7 ^= (rp7 << 4);
rp7 ^= (rp7 << 2);
rp7 ^= (rp7 << 1);
rp7 &= 0x80;
No gain.
The only marginal change was inverting the parity bits, so we can remove
the last three invert statements.
Ah well, pity this does not deliver more. Then again 10 million
iterations using the linux driver code takes between 13 and 13.5
seconds, whereas my code now takes about 0.73 seconds for those 10
million iterations. So basically I've improved the performance by a
factor 18 on my system. Not that bad. Of course on different hardware
you will get different results. No warranties!
But of course there is no such thing as a free lunch. The codesize almost
tripled (from 562 bytes to 1434 bytes). Then again, it is not that much.
Correcting errors
=================
For correcting errors I again used the ST application note as a starter,
but I also peeked at the existing code.
The algorithm itself is pretty straightforward. Just xor the given and
the calculated ecc. If all bytes are 0 there is no problem. If 11 bits
are 1 we have one correctable bit error. If there is 1 bit 1, we have an
error in the given ecc code.
It proved to be fastest to do some table lookups. Performance gain
introduced by this is about a factor 2 on my system when a repair had to
be done, and 1% or so if no repair had to be done.
Code size increased from 330 bytes to 686 bytes for this function.
(gcc 4.2, -O3)
Conclusion
==========
The gain when calculating the ecc is tremendous. Om my development hardware
a speedup of a factor of 18 for ecc calculation was achieved. On a test on an
embedded system with a MIPS core a factor 7 was obtained.
On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor
5 (big endian mode, gcc 4.1.2, -O3)
For correction not much gain could be obtained (as bitflips are rare). Then
again there are also much less cycles spent there.
It seems there is not much more gain possible in this, at least when
programmed in C. Of course it might be possible to squeeze something more
out of it with an assembler program, but due to pipeline behaviour etc
this is very tricky (at least for intel hw).
Author: Frans Meulenbroeks
Copyright (C) 2008 Koninklijke Philips Electronics NV.

View File

@@ -0,0 +1,66 @@
=================
SPI NOR framework
=================
Part I - Why do we need this framework?
---------------------------------------
SPI bus controllers (drivers/spi/) only deal with streams of bytes; the bus
controller operates agnostic of the specific device attached. However, some
controllers (such as Freescale's QuadSPI controller) cannot easily handle
arbitrary streams of bytes, but rather are designed specifically for SPI NOR.
In particular, Freescale's QuadSPI controller must know the NOR commands to
find the right LUT sequence. Unfortunately, the SPI subsystem has no notion of
opcodes, addresses, or data payloads; a SPI controller simply knows to send or
receive bytes (Tx and Rx). Therefore, we must define a new layering scheme under
which the controller driver is aware of the opcodes, addressing, and other
details of the SPI NOR protocol.
Part II - How does the framework work?
--------------------------------------
This framework just adds a new layer between the MTD and the SPI bus driver.
With this new layer, the SPI NOR controller driver does not depend on the
m25p80 code anymore.
Before this framework, the layer is like::
MTD
------------------------
m25p80
------------------------
SPI bus driver
------------------------
SPI NOR chip
After this framework, the layer is like:
MTD
------------------------
SPI NOR framework
------------------------
m25p80
------------------------
SPI bus driver
------------------------
SPI NOR chip
With the SPI NOR controller driver (Freescale QuadSPI), it looks like:
MTD
------------------------
SPI NOR framework
------------------------
fsl-quadSPI
------------------------
SPI NOR chip
Part III - How can drivers use the framework?
---------------------------------------------
The main API is spi_nor_scan(). Before you call the hook, a driver should
initialize the necessary fields for spi_nor{}. Please see
drivers/mtd/spi-nor/spi-nor.c for detail. Please also refer to fsl-quadspi.c
when you want to write a new driver for a SPI NOR controller.
Another API is spi_nor_restore(), this is used to restore the status of SPI
flash chip such as addressing mode. Call it whenever detach the driver from
device or reboot the system.

View File

@@ -0,0 +1,11 @@
.. SPDX-License-Identifier: GPL-2.0
========================
Near Field Communication
========================
.. toctree::
:maxdepth: 1
nfc-hci
nfc-pn544

View File

@@ -0,0 +1,311 @@
========================
HCI backend for NFC Core
========================
- Author: Eric Lapuyade, Samuel Ortiz
- Contact: eric.lapuyade@intel.com, samuel.ortiz@intel.com
General
-------
The HCI layer implements much of the ETSI TS 102 622 V10.2.0 specification. It
enables easy writing of HCI-based NFC drivers. The HCI layer runs as an NFC Core
backend, implementing an abstract nfc device and translating NFC Core API
to HCI commands and events.
HCI
---
HCI registers as an nfc device with NFC Core. Requests coming from userspace are
routed through netlink sockets to NFC Core and then to HCI. From this point,
they are translated in a sequence of HCI commands sent to the HCI layer in the
host controller (the chip). Commands can be executed synchronously (the sending
context blocks waiting for response) or asynchronously (the response is returned
from HCI Rx context).
HCI events can also be received from the host controller. They will be handled
and a translation will be forwarded to NFC Core as needed. There are hooks to
let the HCI driver handle proprietary events or override standard behavior.
HCI uses 2 execution contexts:
- one for executing commands : nfc_hci_msg_tx_work(). Only one command
can be executing at any given moment.
- one for dispatching received events and commands : nfc_hci_msg_rx_work().
HCI Session initialization
--------------------------
The Session initialization is an HCI standard which must unfortunately
support proprietary gates. This is the reason why the driver will pass a list
of proprietary gates that must be part of the session. HCI will ensure all
those gates have pipes connected when the hci device is set up.
In case the chip supports pre-opened gates and pseudo-static pipes, the driver
can pass that information to HCI core.
HCI Gates and Pipes
-------------------
A gate defines the 'port' where some service can be found. In order to access
a service, one must create a pipe to that gate and open it. In this
implementation, pipes are totally hidden. The public API only knows gates.
This is consistent with the driver need to send commands to proprietary gates
without knowing the pipe connected to it.
Driver interface
----------------
A driver is generally written in two parts : the physical link management and
the HCI management. This makes it easier to maintain a driver for a chip that
can be connected using various phy (i2c, spi, ...)
HCI Management
--------------
A driver would normally register itself with HCI and provide the following
entry points::
struct nfc_hci_ops {
int (*open)(struct nfc_hci_dev *hdev);
void (*close)(struct nfc_hci_dev *hdev);
int (*hci_ready) (struct nfc_hci_dev *hdev);
int (*xmit) (struct nfc_hci_dev *hdev, struct sk_buff *skb);
int (*start_poll) (struct nfc_hci_dev *hdev,
u32 im_protocols, u32 tm_protocols);
int (*dep_link_up)(struct nfc_hci_dev *hdev, struct nfc_target *target,
u8 comm_mode, u8 *gb, size_t gb_len);
int (*dep_link_down)(struct nfc_hci_dev *hdev);
int (*target_from_gate) (struct nfc_hci_dev *hdev, u8 gate,
struct nfc_target *target);
int (*complete_target_discovered) (struct nfc_hci_dev *hdev, u8 gate,
struct nfc_target *target);
int (*im_transceive) (struct nfc_hci_dev *hdev,
struct nfc_target *target, struct sk_buff *skb,
data_exchange_cb_t cb, void *cb_context);
int (*tm_send)(struct nfc_hci_dev *hdev, struct sk_buff *skb);
int (*check_presence)(struct nfc_hci_dev *hdev,
struct nfc_target *target);
int (*event_received)(struct nfc_hci_dev *hdev, u8 gate, u8 event,
struct sk_buff *skb);
};
- open() and close() shall turn the hardware on and off.
- hci_ready() is an optional entry point that is called right after the hci
session has been set up. The driver can use it to do additional initialization
that must be performed using HCI commands.
- xmit() shall simply write a frame to the physical link.
- start_poll() is an optional entrypoint that shall set the hardware in polling
mode. This must be implemented only if the hardware uses proprietary gates or a
mechanism slightly different from the HCI standard.
- dep_link_up() is called after a p2p target has been detected, to finish
the p2p connection setup with hardware parameters that need to be passed back
to nfc core.
- dep_link_down() is called to bring the p2p link down.
- target_from_gate() is an optional entrypoint to return the nfc protocols
corresponding to a proprietary gate.
- complete_target_discovered() is an optional entry point to let the driver
perform additional proprietary processing necessary to auto activate the
discovered target.
- im_transceive() must be implemented by the driver if proprietary HCI commands
are required to send data to the tag. Some tag types will require custom
commands, others can be written to using the standard HCI commands. The driver
can check the tag type and either do proprietary processing, or return 1 to ask
for standard processing. The data exchange command itself must be sent
asynchronously.
- tm_send() is called to send data in the case of a p2p connection
- check_presence() is an optional entry point that will be called regularly
by the core to check that an activated tag is still in the field. If this is
not implemented, the core will not be able to push tag_lost events to the user
space
- event_received() is called to handle an event coming from the chip. Driver
can handle the event or return 1 to let HCI attempt standard processing.
On the rx path, the driver is responsible to push incoming HCP frames to HCI
using nfc_hci_recv_frame(). HCI will take care of re-aggregation and handling
This must be done from a context that can sleep.
PHY Management
--------------
The physical link (i2c, ...) management is defined by the following structure::
struct nfc_phy_ops {
int (*write)(void *dev_id, struct sk_buff *skb);
int (*enable)(void *dev_id);
void (*disable)(void *dev_id);
};
enable():
turn the phy on (power on), make it ready to transfer data
disable():
turn the phy off
write():
Send a data frame to the chip. Note that to enable higher
layers such as an llc to store the frame for re-emission, this
function must not alter the skb. It must also not return a positive
result (return 0 for success, negative for failure).
Data coming from the chip shall be sent directly to nfc_hci_recv_frame().
LLC
---
Communication between the CPU and the chip often requires some link layer
protocol. Those are isolated as modules managed by the HCI layer. There are
currently two modules : nop (raw transfert) and shdlc.
A new llc must implement the following functions::
struct nfc_llc_ops {
void *(*init) (struct nfc_hci_dev *hdev, xmit_to_drv_t xmit_to_drv,
rcv_to_hci_t rcv_to_hci, int tx_headroom,
int tx_tailroom, int *rx_headroom, int *rx_tailroom,
llc_failure_t llc_failure);
void (*deinit) (struct nfc_llc *llc);
int (*start) (struct nfc_llc *llc);
int (*stop) (struct nfc_llc *llc);
void (*rcv_from_drv) (struct nfc_llc *llc, struct sk_buff *skb);
int (*xmit_from_hci) (struct nfc_llc *llc, struct sk_buff *skb);
};
init():
allocate and init your private storage
deinit():
cleanup
start():
establish the logical connection
stop ():
terminate the logical connection
rcv_from_drv():
handle data coming from the chip, going to HCI
xmit_from_hci():
handle data sent by HCI, going to the chip
The llc must be registered with nfc before it can be used. Do that by
calling::
nfc_llc_register(const char *name, struct nfc_llc_ops *ops);
Again, note that the llc does not handle the physical link. It is thus very
easy to mix any physical link with any llc for a given chip driver.
Included Drivers
----------------
An HCI based driver for an NXP PN544, connected through I2C bus, and using
shdlc is included.
Execution Contexts
------------------
The execution contexts are the following:
- IRQ handler (IRQH):
fast, cannot sleep. sends incoming frames to HCI where they are passed to
the current llc. In case of shdlc, the frame is queued in shdlc rx queue.
- SHDLC State Machine worker (SMW)
Only when llc_shdlc is used: handles shdlc rx & tx queues.
Dispatches HCI cmd responses.
- HCI Tx Cmd worker (MSGTXWQ)
Serializes execution of HCI commands.
Completes execution in case of response timeout.
- HCI Rx worker (MSGRXWQ)
Dispatches incoming HCI commands or events.
- Syscall context from a userspace call (SYSCALL)
Any entrypoint in HCI called from NFC Core
Workflow executing an HCI command (using shdlc)
-----------------------------------------------
Executing an HCI command can easily be performed synchronously using the
following API::
int nfc_hci_send_cmd (struct nfc_hci_dev *hdev, u8 gate, u8 cmd,
const u8 *param, size_t param_len, struct sk_buff **skb)
The API must be invoked from a context that can sleep. Most of the time, this
will be the syscall context. skb will return the result that was received in
the response.
Internally, execution is asynchronous. So all this API does is to enqueue the
HCI command, setup a local wait queue on stack, and wait_event() for completion.
The wait is not interruptible because it is guaranteed that the command will
complete after some short timeout anyway.
MSGTXWQ context will then be scheduled and invoke nfc_hci_msg_tx_work().
This function will dequeue the next pending command and send its HCP fragments
to the lower layer which happens to be shdlc. It will then start a timer to be
able to complete the command with a timeout error if no response arrive.
SMW context gets scheduled and invokes nfc_shdlc_sm_work(). This function
handles shdlc framing in and out. It uses the driver xmit to send frames and
receives incoming frames in an skb queue filled from the driver IRQ handler.
SHDLC I(nformation) frames payload are HCP fragments. They are aggregated to
form complete HCI frames, which can be a response, command, or event.
HCI Responses are dispatched immediately from this context to unblock
waiting command execution. Response processing involves invoking the completion
callback that was provided by nfc_hci_msg_tx_work() when it sent the command.
The completion callback will then wake the syscall context.
It is also possible to execute the command asynchronously using this API::
static int nfc_hci_execute_cmd_async(struct nfc_hci_dev *hdev, u8 pipe, u8 cmd,
const u8 *param, size_t param_len,
data_exchange_cb_t cb, void *cb_context)
The workflow is the same, except that the API call returns immediately, and
the callback will be called with the result from the SMW context.
Workflow receiving an HCI event or command
------------------------------------------
HCI commands or events are not dispatched from SMW context. Instead, they are
queued to HCI rx_queue and will be dispatched from HCI rx worker
context (MSGRXWQ). This is done this way to allow a cmd or event handler
to also execute other commands (for example, handling the
NFC_HCI_EVT_TARGET_DISCOVERED event from PN544 requires to issue an
ANY_GET_PARAMETER to the reader A gate to get information on the target
that was discovered).
Typically, such an event will be propagated to NFC Core from MSGRXWQ context.
Error management
----------------
Errors that occur synchronously with the execution of an NFC Core request are
simply returned as the execution result of the request. These are easy.
Errors that occur asynchronously (e.g. in a background protocol handling thread)
must be reported such that upper layers don't stay ignorant that something
went wrong below and know that expected events will probably never happen.
Handling of these errors is done as follows:
- driver (pn544) fails to deliver an incoming frame: it stores the error such
that any subsequent call to the driver will result in this error. Then it
calls the standard nfc_shdlc_recv_frame() with a NULL argument to report the
problem above. shdlc stores a EREMOTEIO sticky status, which will trigger
SMW to report above in turn.
- SMW is basically a background thread to handle incoming and outgoing shdlc
frames. This thread will also check the shdlc sticky status and report to HCI
when it discovers it is not able to run anymore because of an unrecoverable
error that happened within shdlc or below. If the problem occurs during shdlc
connection, the error is reported through the connect completion.
- HCI: if an internal HCI error happens (frame is lost), or HCI is reported an
error from a lower layer, HCI will either complete the currently executing
command with that error, or notify NFC Core directly if no command is
executing.
- NFC Core: when NFC Core is notified of an error from below and polling is
active, it will send a tag discovered event with an empty tag list to the user
space to let it know that the poll operation will never be able to detect a
tag. If polling is not active and the error was sticky, lower levels will
return it at next invocation.

View File

@@ -0,0 +1,34 @@
============================================================================
Kernel driver for the NXP Semiconductors PN544 Near Field Communication chip
============================================================================
General
-------
The PN544 is an integrated transmission module for contactless
communication. The driver goes under drives/nfc/ and is compiled as a
module named "pn544".
Host Interfaces: I2C, SPI and HSU, this driver supports currently only I2C.
Protocols
---------
In the normal (HCI) mode and in the firmware update mode read and
write functions behave a bit differently because the message formats
or the protocols are different.
In the normal (HCI) mode the protocol used is derived from the ETSI
HCI specification. The firmware is updated using a specific protocol,
which is different from HCI.
HCI messages consist of an eight bit header and the message body. The
header contains the message length. Maximum size for an HCI message is
33. In HCI mode sent messages are tested for a correct
checksum. Firmware update messages have the length in the second (MSB)
and third (LSB) bytes of the message. The maximum FW message length is
1024 bytes.
For the ETSI HCI specification see
http://www.etsi.org/WebSite/Technologies/ProtocolSpecification.aspx

View File

@@ -0,0 +1,236 @@
===========
NTB Drivers
===========
NTB (Non-Transparent Bridge) is a type of PCI-Express bridge chip that connects
the separate memory systems of two or more computers to the same PCI-Express
fabric. Existing NTB hardware supports a common feature set: doorbell
registers and memory translation windows, as well as non common features like
scratchpad and message registers. Scratchpad registers are read-and-writable
registers that are accessible from either side of the device, so that peers can
exchange a small amount of information at a fixed address. Message registers can
be utilized for the same purpose. Additionally they are provided with with
special status bits to make sure the information isn't rewritten by another
peer. Doorbell registers provide a way for peers to send interrupt events.
Memory windows allow translated read and write access to the peer memory.
NTB Core Driver (ntb)
=====================
The NTB core driver defines an api wrapping the common feature set, and allows
clients interested in NTB features to discover NTB the devices supported by
hardware drivers. The term "client" is used here to mean an upper layer
component making use of the NTB api. The term "driver," or "hardware driver,"
is used here to mean a driver for a specific vendor and model of NTB hardware.
NTB Client Drivers
==================
NTB client drivers should register with the NTB core driver. After
registering, the client probe and remove functions will be called appropriately
as ntb hardware, or hardware drivers, are inserted and removed. The
registration uses the Linux Device framework, so it should feel familiar to
anyone who has written a pci driver.
NTB Typical client driver implementation
----------------------------------------
Primary purpose of NTB is to share some peace of memory between at least two
systems. So the NTB device features like Scratchpad/Message registers are
mainly used to perform the proper memory window initialization. Typically
there are two types of memory window interfaces supported by the NTB API:
inbound translation configured on the local ntb port and outbound translation
configured by the peer, on the peer ntb port. The first type is
depicted on the next figure::
Inbound translation:
Memory: Local NTB Port: Peer NTB Port: Peer MMIO:
____________
| dma-mapped |-ntb_mw_set_trans(addr) |
| memory | _v____________ | ______________
| (addr) |<======| MW xlat addr |<====| MW base addr |<== memory-mapped IO
|------------| |--------------| | |--------------|
So typical scenario of the first type memory window initialization looks:
1) allocate a memory region, 2) put translated address to NTB config,
3) somehow notify a peer device of performed initialization, 4) peer device
maps corresponding outbound memory window so to have access to the shared
memory region.
The second type of interface, that implies the shared windows being
initialized by a peer device, is depicted on the figure::
Outbound translation:
Memory: Local NTB Port: Peer NTB Port: Peer MMIO:
____________ ______________
| dma-mapped | | | MW base addr |<== memory-mapped IO
| memory | | |--------------|
| (addr) |<===================| MW xlat addr |<-ntb_peer_mw_set_trans(addr)
|------------| | |--------------|
Typical scenario of the second type interface initialization would be:
1) allocate a memory region, 2) somehow deliver a translated address to a peer
device, 3) peer puts the translated address to NTB config, 4) peer device maps
outbound memory window so to have access to the shared memory region.
As one can see the described scenarios can be combined in one portable
algorithm.
Local device:
1) Allocate memory for a shared window
2) Initialize memory window by translated address of the allocated region
(it may fail if local memory window initialization is unsupported)
3) Send the translated address and memory window index to a peer device
Peer device:
1) Initialize memory window with retrieved address of the allocated
by another device memory region (it may fail if peer memory window
initialization is unsupported)
2) Map outbound memory window
In accordance with this scenario, the NTB Memory Window API can be used as
follows:
Local device:
1) ntb_mw_count(pidx) - retrieve number of memory ranges, which can
be allocated for memory windows between local device and peer device
of port with specified index.
2) ntb_get_align(pidx, midx) - retrieve parameters restricting the
shared memory region alignment and size. Then memory can be properly
allocated.
3) Allocate physically contiguous memory region in compliance with
restrictions retrieved in 2).
4) ntb_mw_set_trans(pidx, midx) - try to set translation address of
the memory window with specified index for the defined peer device
(it may fail if local translated address setting is not supported)
5) Send translated base address (usually together with memory window
number) to the peer device using, for instance, scratchpad or message
registers.
Peer device:
1) ntb_peer_mw_set_trans(pidx, midx) - try to set received from other
device (related to pidx) translated address for specified memory
window. It may fail if retrieved address, for instance, exceeds
maximum possible address or isn't properly aligned.
2) ntb_peer_mw_get_addr(widx) - retrieve MMIO address to map the memory
window so to have an access to the shared memory.
Also it is worth to note, that method ntb_mw_count(pidx) should return the
same value as ntb_peer_mw_count() on the peer with port index - pidx.
NTB Transport Client (ntb\_transport) and NTB Netdev (ntb\_netdev)
------------------------------------------------------------------
The primary client for NTB is the Transport client, used in tandem with NTB
Netdev. These drivers function together to create a logical link to the peer,
across the ntb, to exchange packets of network data. The Transport client
establishes a logical link to the peer, and creates queue pairs to exchange
messages and data. The NTB Netdev then creates an ethernet device using a
Transport queue pair. Network data is copied between socket buffers and the
Transport queue pair buffer. The Transport client may be used for other things
besides Netdev, however no other applications have yet been written.
NTB Ping Pong Test Client (ntb\_pingpong)
-----------------------------------------
The Ping Pong test client serves as a demonstration to exercise the doorbell
and scratchpad registers of NTB hardware, and as an example simple NTB client.
Ping Pong enables the link when started, waits for the NTB link to come up, and
then proceeds to read and write the doorbell scratchpad registers of the NTB.
The peers interrupt each other using a bit mask of doorbell bits, which is
shifted by one in each round, to test the behavior of multiple doorbell bits
and interrupt vectors. The Ping Pong driver also reads the first local
scratchpad, and writes the value plus one to the first peer scratchpad, each
round before writing the peer doorbell register.
Module Parameters:
* unsafe - Some hardware has known issues with scratchpad and doorbell
registers. By default, Ping Pong will not attempt to exercise such
hardware. You may override this behavior at your own risk by setting
unsafe=1.
* delay\_ms - Specify the delay between receiving a doorbell
interrupt event and setting the peer doorbell register for the next
round.
* init\_db - Specify the doorbell bits to start new series of rounds. A new
series begins once all the doorbell bits have been shifted out of
range.
* dyndbg - It is suggested to specify dyndbg=+p when loading this module, and
then to observe debugging output on the console.
NTB Tool Test Client (ntb\_tool)
--------------------------------
The Tool test client serves for debugging, primarily, ntb hardware and drivers.
The Tool provides access through debugfs for reading, setting, and clearing the
NTB doorbell, and reading and writing scratchpads.
The Tool does not currently have any module parameters.
Debugfs Files:
* *debugfs*/ntb\_tool/*hw*/
A directory in debugfs will be created for each
NTB device probed by the tool. This directory is shortened to *hw*
below.
* *hw*/db
This file is used to read, set, and clear the local doorbell. Not
all operations may be supported by all hardware. To read the doorbell,
read the file. To set the doorbell, write `s` followed by the bits to
set (eg: `echo 's 0x0101' > db`). To clear the doorbell, write `c`
followed by the bits to clear.
* *hw*/mask
This file is used to read, set, and clear the local doorbell mask.
See *db* for details.
* *hw*/peer\_db
This file is used to read, set, and clear the peer doorbell.
See *db* for details.
* *hw*/peer\_mask
This file is used to read, set, and clear the peer doorbell
mask. See *db* for details.
* *hw*/spad
This file is used to read and write local scratchpads. To read
the values of all scratchpads, read the file. To write values, write a
series of pairs of scratchpad number and value
(eg: `echo '4 0x123 7 0xabc' > spad`
# to set scratchpads `4` and `7` to `0x123` and `0xabc`, respectively).
* *hw*/peer\_spad
This file is used to read and write peer scratchpads. See
*spad* for details.
NTB Hardware Drivers
====================
NTB hardware drivers should register devices with the NTB core driver. After
registering, clients probe and remove functions will be called.
NTB Intel Hardware Driver (ntb\_hw\_intel)
------------------------------------------
The Intel hardware driver supports NTB on Xeon and Atom CPUs.
Module Parameters:
* b2b\_mw\_idx
If the peer ntb is to be accessed via a memory window, then use
this memory window to access the peer ntb. A value of zero or positive
starts from the first mw idx, and a negative value starts from the last
mw idx. Both sides MUST set the same value here! The default value is
`-1`.
* b2b\_mw\_share
If the peer ntb is to be accessed via a memory window, and if
the memory window is large enough, still allow the client to use the
second half of the memory window for address translation to the peer.
* xeon\_b2b\_usd\_bar2\_addr64
If using B2B topology on Xeon hardware, use
this 64 bit address on the bus between the NTB devices for the window
at BAR2, on the upstream side of the link.
* xeon\_b2b\_usd\_bar4\_addr64 - See *xeon\_b2b\_bar2\_addr64*.
* xeon\_b2b\_usd\_bar4\_addr32 - See *xeon\_b2b\_bar2\_addr64*.
* xeon\_b2b\_usd\_bar5\_addr32 - See *xeon\_b2b\_bar2\_addr64*.
* xeon\_b2b\_dsd\_bar2\_addr64 - See *xeon\_b2b\_bar2\_addr64*.
* xeon\_b2b\_dsd\_bar4\_addr64 - See *xeon\_b2b\_bar2\_addr64*.
* xeon\_b2b\_dsd\_bar4\_addr32 - See *xeon\_b2b\_bar2\_addr64*.
* xeon\_b2b\_dsd\_bar5\_addr32 - See *xeon\_b2b\_bar2\_addr64*.

View File

@@ -0,0 +1,285 @@
=============================
BTT - Block Translation Table
=============================
1. Introduction
===============
Persistent memory based storage is able to perform IO at byte (or more
accurately, cache line) granularity. However, we often want to expose such
storage as traditional block devices. The block drivers for persistent memory
will do exactly this. However, they do not provide any atomicity guarantees.
Traditional SSDs typically provide protection against torn sectors in hardware,
using stored energy in capacitors to complete in-flight block writes, or perhaps
in firmware. We don't have this luxury with persistent memory - if a write is in
progress, and we experience a power failure, the block will contain a mix of old
and new data. Applications may not be prepared to handle such a scenario.
The Block Translation Table (BTT) provides atomic sector update semantics for
persistent memory devices, so that applications that rely on sector writes not
being torn can continue to do so. The BTT manifests itself as a stacked block
device, and reserves a portion of the underlying storage for its metadata. At
the heart of it, is an indirection table that re-maps all the blocks on the
volume. It can be thought of as an extremely simple file system that only
provides atomic sector updates.
2. Static Layout
================
The underlying storage on which a BTT can be laid out is not limited in any way.
The BTT, however, splits the available space into chunks of up to 512 GiB,
called "Arenas".
Each arena follows the same layout for its metadata, and all references in an
arena are internal to it (with the exception of one field that points to the
next arena). The following depicts the "On-disk" metadata layout::
Backing Store +-------> Arena
+---------------+ | +------------------+
| | | | Arena info block |
| Arena 0 +---+ | 4K |
| 512G | +------------------+
| | | |
+---------------+ | |
| | | |
| Arena 1 | | Data Blocks |
| 512G | | |
| | | |
+---------------+ | |
| . | | |
| . | | |
| . | | |
| | | |
| | | |
+---------------+ +------------------+
| |
| BTT Map |
| |
| |
+------------------+
| |
| BTT Flog |
| |
+------------------+
| Info block copy |
| 4K |
+------------------+
3. Theory of Operation
======================
a. The BTT Map
--------------
The map is a simple lookup/indirection table that maps an LBA to an internal
block. Each map entry is 32 bits. The two most significant bits are special
flags, and the remaining form the internal block number.
======== =============================================================
Bit Description
======== =============================================================
31 - 30 Error and Zero flags - Used in the following way::
== == ====================================================
31 30 Description
== == ====================================================
0 0 Initial state. Reads return zeroes; Premap = Postmap
0 1 Zero state: Reads return zeroes
1 0 Error state: Reads fail; Writes clear 'E' bit
1 1 Normal Block has valid postmap
== == ====================================================
29 - 0 Mappings to internal 'postmap' blocks
======== =============================================================
Some of the terminology that will be subsequently used:
============ ================================================================
External LBA LBA as made visible to upper layers.
ABA Arena Block Address - Block offset/number within an arena
Premap ABA The block offset into an arena, which was decided upon by range
checking the External LBA
Postmap ABA The block number in the "Data Blocks" area obtained after
indirection from the map
nfree The number of free blocks that are maintained at any given time.
This is the number of concurrent writes that can happen to the
arena.
============ ================================================================
For example, after adding a BTT, we surface a disk of 1024G. We get a read for
the external LBA at 768G. This falls into the second arena, and of the 512G
worth of blocks that this arena contributes, this block is at 256G. Thus, the
premap ABA is 256G. We now refer to the map, and find out the mapping for block
'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
b. The BTT Flog
---------------
The BTT provides sector atomicity by making every write an "allocating write",
i.e. Every write goes to a "free" block. A running list of free blocks is
maintained in the form of the BTT flog. 'Flog' is a combination of the words
"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
======== =====================================================================
lba The premap ABA that is being written to
old_map The old postmap ABA - after 'this' write completes, this will be a
free block.
new_map The new postmap ABA. The map will up updated to reflect this
lba->postmap_aba mapping, but we log it here in case we have to
recover.
seq Sequence number to mark which of the 2 sections of this flog entry is
valid/newest. It cycles between 01->10->11->01 (binary) under normal
operation, with 00 indicating an uninitialized state.
lba' alternate lba entry
old_map' alternate old postmap entry
new_map' alternate new postmap entry
seq' alternate sequence number.
======== =====================================================================
Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also
padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
done such that for any entry being written, it:
a. overwrites the 'old' section in the entry based on sequence numbers
b. writes the 'new' section such that the sequence number is written last.
c. The concept of lanes
-----------------------
While 'nfree' describes the number of concurrent IOs an arena can process
concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
process::
nlanes = min(nfree, num_cpus)
A lane number is obtained at the start of any IO, and is used for indexing into
all the on-disk and in-memory data structures for the duration of the IO. If
there are more CPUs than the max number of available lanes, than lanes are
protected by spinlocks.
d. In-memory data structure: Read Tracking Table (RTT)
------------------------------------------------------
Consider a case where we have two threads, one doing reads and the other,
writes. We can hit a condition where the writer thread grabs a free block to do
a new IO, but the (slow) reader thread is still reading from it. In other words,
the reader consulted a map entry, and started reading the corresponding block. A
writer started writing to the same external LBA, and finished the write updating
the map for that external LBA to point to its new postmap ABA. At this point the
internal, postmap block that the reader is (still) reading has been inserted
into the list of free blocks. If another write comes in for the same LBA, it can
grab this free block, and start writing to it, causing the reader to read
incorrect data. To prevent this, we introduce the RTT.
The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
into rtt[lane_number], the postmap ABA it is reading, and clears it after the
read is complete. Every writer thread, after grabbing a free block, checks the
RTT for its presence. If the postmap free block is in the RTT, it waits till the
reader clears the RTT entry, and only then starts writing to it.
e. In-memory data structure: map locks
--------------------------------------
Consider a case where two writer threads are writing to the same LBA. There can
be a race in the following sequence of steps::
free[lane] = map[premap_aba]
map[premap_aba] = postmap_aba
Both threads can update their respective free[lane] with the same old, freed
postmap_aba. This has made the layout inconsistent by losing a free entry, and
at the same time, duplicating another free entry for two lanes.
To solve this, we could have a single map lock (per arena) that has to be taken
before performing the above sequence, but we feel that could be too contentious.
Instead we use an array of (nfree) map_locks that is indexed by
(premap_aba modulo nfree).
f. Reconstruction from the Flog
-------------------------------
On startup, we analyze the BTT flog to create our list of free blocks. We walk
through all the entries, and for each lane, of the set of two possible
'sections', we always look at the most recent one only (based on the sequence
number). The reconstruction rules/steps are simple:
- Read map[log_entry.lba].
- If log_entry.new matches the map entry, then log_entry.old is free.
- If log_entry.new does not match the map entry, then log_entry.new is free.
(This case can only be caused by power-fails/unsafe shutdowns)
g. Summarizing - Read and Write flows
-------------------------------------
Read:
1. Convert external LBA to arena number + pre-map ABA
2. Get a lane (and take lane_lock)
3. Read map to get the entry for this pre-map ABA
4. Enter post-map ABA into RTT[lane]
5. If TRIM flag set in map, return zeroes, and end IO (go to step 8)
6. If ERROR flag set in map, end IO with EIO (go to step 8)
7. Read data from this block
8. Remove post-map ABA entry from RTT[lane]
9. Release lane (and lane_lock)
Write:
1. Convert external LBA to Arena number + pre-map ABA
2. Get a lane (and take lane_lock)
3. Use lane to index into in-memory free list and obtain a new block, next flog
index, next sequence number
4. Scan the RTT to check if free block is present, and spin/wait if it is.
5. Write data to this free block
6. Read map to get the existing post-map ABA entry for this pre-map ABA
7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
8. Write new post-map ABA into map.
9. Write old post-map entry into the free list
10. Calculate next sequence number and write into the free list entry
11. Release lane (and lane_lock)
4. Error Handling
=================
An arena would be in an error state if any of the metadata is corrupted
irrecoverably, either due to a bug or a media error. The following conditions
indicate an error:
- Info block checksum does not match (and recovering from the copy also fails)
- All internal available blocks are not uniquely and entirely addressed by the
sum of mapped blocks and free blocks (from the BTT flog).
- Rebuilding free list from the flog reveals missing/duplicate/impossible
entries
- A map entry is out of bounds
If any of these error conditions are encountered, the arena is put into a read
only state using a flag in the info block.
5. Usage
========
The BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem
(pmem, or blk mode). The easiest way to set up such a namespace is using the
'ndctl' utility [1]:
For example, the ndctl command line to setup a btt with a 4k sector size is::
ndctl create-namespace -f -e namespace0.0 -m sector -l 4k
See ndctl create-namespace --help for more options.
[1]: https://github.com/pmem/ndctl

View File

@@ -0,0 +1,12 @@
.. SPDX-License-Identifier: GPL-2.0
===================================
Non-Volatile Memory Device (NVDIMM)
===================================
.. toctree::
:maxdepth: 1
nvdimm
btt
security

View File

@@ -0,0 +1,887 @@
===============================
LIBNVDIMM: Non-Volatile Devices
===============================
libnvdimm - kernel / libndctl - userspace helper library
linux-nvdimm@lists.01.org
Version 13
.. contents:
Glossary
Overview
Supporting Documents
Git Trees
LIBNVDIMM PMEM and BLK
Why BLK?
PMEM vs BLK
BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
Example NVDIMM Platform
LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
LIBNDCTL: Context
libndctl: instantiate a new library context example
LIBNVDIMM/LIBNDCTL: Bus
libnvdimm: control class device in /sys/class
libnvdimm: bus
libndctl: bus enumeration example
LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
libnvdimm: DIMM (NMEM)
libndctl: DIMM enumeration example
LIBNVDIMM/LIBNDCTL: Region
libnvdimm: region
libndctl: region enumeration example
Why Not Encode the Region Type into the Region Name?
How Do I Determine the Major Type of a Region?
LIBNVDIMM/LIBNDCTL: Namespace
libnvdimm: namespace
libndctl: namespace enumeration example
libndctl: namespace creation example
Why the Term "namespace"?
LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
libnvdimm: btt layout
libndctl: btt creation example
Summary LIBNDCTL Diagram
Glossary
========
PMEM:
A system-physical-address range where writes are persistent. A
block device composed of PMEM is capable of DAX. A PMEM address range
may span an interleave of several DIMMs.
BLK:
A set of one or more programmable memory mapped apertures provided
by a DIMM to access its media. This indirection precludes the
performance benefit of interleaving, but enables DIMM-bounded failure
modes.
DPA:
DIMM Physical Address, is a DIMM-relative offset. With one DIMM in
the system there would be a 1:1 system-physical-address:DPA association.
Once more DIMMs are added a memory controller interleave must be
decoded to determine the DPA associated with a given
system-physical-address. BLK capacity always has a 1:1 relationship
with a single-DIMM's DPA range.
DAX:
File system extensions to bypass the page cache and block layer to
mmap persistent memory, from a PMEM block device, directly into a
process address space.
DSM:
Device Specific Method: ACPI method to to control specific
device - in this case the firmware.
DCR:
NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5.
It defines a vendor-id, device-id, and interface format for a given DIMM.
BTT:
Block Translation Table: Persistent memory is byte addressable.
Existing software may have an expectation that the power-fail-atomicity
of writes is at least one sector, 512 bytes. The BTT is an indirection
table with atomic update semantics to front a PMEM/BLK block device
driver and present arbitrary atomic sector sizes.
LABEL:
Metadata stored on a DIMM device that partitions and identifies
(persistently names) storage between PMEM and BLK. It also partitions
BLK storage to host BTTs with different parameters per BLK-partition.
Note that traditional partition tables, GPT/MBR, are layered on top of a
BLK or PMEM device.
Overview
========
The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely,
PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM
and BLK mode access. These three modes of operation are described by
the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6. While the LIBNVDIMM
implementation is generic and supports pre-NFIT platforms, it was guided
by the superset of capabilities need to support this ACPI 6 definition
for NVDIMM resources. The bulk of the kernel implementation is in place
to handle the case where DPA accessible via PMEM is aliased with DPA
accessible via BLK. When that occurs a LABEL is needed to reserve DPA
for exclusive access via one mode a time.
Supporting Documents
--------------------
ACPI 6:
http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
NVDIMM Namespace:
http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
DSM Interface Example:
http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
Driver Writer's Guide:
http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
Git Trees
---------
LIBNVDIMM:
https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git
LIBNDCTL:
https://github.com/pmem/ndctl.git
PMEM:
https://github.com/01org/prd
LIBNVDIMM PMEM and BLK
======================
Prior to the arrival of the NFIT, non-volatile memory was described to a
system in various ad-hoc ways. Usually only the bare minimum was
provided, namely, a single system-physical-address range where writes
are expected to be durable after a system power loss. Now, the NFIT
specification standardizes not only the description of PMEM, but also
BLK and platform message-passing entry points for control and
configuration.
For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block
device driver:
1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This
range is contiguous in system memory and may be interleaved (hardware
memory controller striped) across multiple DIMMs. When interleaved the
platform may optionally provide details of which DIMMs are participating
in the interleave.
Note that while LIBNVDIMM describes system-physical-address ranges that may
alias with BLK access as ND_NAMESPACE_PMEM ranges and those without
alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no
distinction. The different device-types are an implementation detail
that userspace can exploit to implement policies like "only interface
with address ranges from certain DIMMs". It is worth noting that when
aliasing is present and a DIMM lacks a label, then no block device can
be created by default as userspace needs to do at least one allocation
of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once
registered, can be immediately attached to nd_pmem.
2. BLK (nd_blk.ko): This driver performs I/O using a set of platform
defined apertures. A set of apertures will access just one DIMM.
Multiple windows (apertures) allow multiple concurrent accesses, much like
tagged-command-queuing, and would likely be used by different threads or
different CPUs.
The NFIT specification defines a standard format for a BLK-aperture, but
the spec also allows for vendor specific layouts, and non-NFIT BLK
implementations may have other designs for BLK I/O. For this reason
"nd_blk" calls back into platform-specific code to perform the I/O.
One such implementation is defined in the "Driver Writer's Guide" and "DSM
Interface Example".
Why BLK?
========
While PMEM provides direct byte-addressable CPU-load/store access to
NVDIMM storage, it does not provide the best system RAS (recovery,
availability, and serviceability) model. An access to a corrupted
system-physical-address address causes a CPU exception while an access
to a corrupted address through an BLK-aperture causes that block window
to raise an error status in a register. The latter is more aligned with
the standard error model that host-bus-adapter attached disks present.
Also, if an administrator ever wants to replace a memory it is easier to
service a system at DIMM module boundaries. Compare this to PMEM where
data could be interleaved in an opaque hardware specific manner across
several DIMMs.
PMEM vs BLK
-----------
BLK-apertures solve these RAS problems, but their presence is also the
major contributing factor to the complexity of the ND subsystem. They
complicate the implementation because PMEM and BLK alias in DPA space.
Any given DIMM's DPA-range may contribute to one or more
system-physical-address sets of interleaved DIMMs, *and* may also be
accessed in its entirety through its BLK-aperture. Accessing a DPA
through a system-physical-address while simultaneously accessing the
same DPA through a BLK-aperture has undefined results. For this reason,
DIMMs with this dual interface configuration include a DSM function to
store/retrieve a LABEL. The LABEL effectively partitions the DPA-space
into exclusive system-physical-address and BLK-aperture accessible
regions. For simplicity a DIMM is allowed a PMEM "region" per each
interleave set in which it is a member. The remaining DPA space can be
carved into an arbitrary number of BLK devices with discontiguous
extents.
BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
One of the few
reasons to allow multiple BLK namespaces per REGION is so that each
BLK-namespace can be configured with a BTT with unique atomic sector
sizes. While a PMEM device can host a BTT the LABEL specification does
not provide for a sector size to be specified for a PMEM namespace.
This is due to the expectation that the primary usage model for PMEM is
via DAX, and the BTT is incompatible with DAX. However, for the cases
where an application or filesystem still needs atomic sector update
guarantees it can register a BTT on a PMEM device or partition. See
LIBNVDIMM/NDCTL: Block Translation Table "btt"
Example NVDIMM Platform
=======================
For the remainder of this document the following diagram will be
referenced for any example sysfs layouts::
(a) (b) DIMM BLK-REGION
+-------------------+--------+--------+--------+
+------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2
| imc0 +--+- - - region0- - - +--------+ +--------+
+--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3
| +-------------------+--------v v--------+
+--+---+ | |
| cpu0 | region1
+--+---+ | |
| +----------------------------^ ^--------+
+--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4
| imc1 +--+----------------------------| +--------+
+------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5
+----------------------------+--------+--------+
In this platform we have four DIMMs and two memory controllers in one
socket. Each unique interface (BLK or PMEM) to DPA space is identified
by a region device with a dynamically assigned id (REGION0 - REGION5).
1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
single PMEM namespace is created in the REGION0-SPA-range that spans most
of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
interleaved system-physical-address range is reclaimed as BLK-aperture
accessed space starting at DPA-offset (a) into each DIMM. In that
reclaimed space we create two BLK-aperture "namespaces" from REGION2 and
REGION3 where "blk2.0" and "blk3.0" are just human readable names that
could be set to any user-desired name in the LABEL.
2. In the last portion of DIMM0 and DIMM1 we have an interleaved
system-physical-address range, REGION1, that spans those two DIMMs as
well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace
named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for
each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
"blk5.0".
3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1
interleaved system-physical-address range (i.e. the DPA address past
offset (b) are also included in the "blk4.0" and "blk5.0" namespaces.
Note, that this example shows that BLK-aperture namespaces don't need to
be contiguous in DPA-space.
This bus is provided by the kernel under the device
/sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and
the nfit_test.ko module is loaded. This not only test LIBNVDIMM but the
acpi_nfit.ko driver as well.
LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
========================================================
What follows is a description of the LIBNVDIMM sysfs layout and a
corresponding object hierarchy diagram as viewed through the LIBNDCTL
API. The example sysfs paths and diagrams are relative to the Example
NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit
test.
LIBNDCTL: Context
-----------------
Every API call in the LIBNDCTL library requires a context that holds the
logging parameters and other library instance state. The library is
based on the libabc template:
https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git
LIBNDCTL: instantiate a new library context example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
::
struct ndctl_ctx *ctx;
if (ndctl_new(&ctx) == 0)
return ctx;
else
return NULL;
LIBNVDIMM/LIBNDCTL: Bus
-----------------------
A bus has a 1:1 relationship with an NFIT. The current expectation for
ACPI based systems is that there is only ever one platform-global NFIT.
That said, it is trivial to register multiple NFITs, the specification
does not preclude it. The infrastructure supports multiple busses and
we use this capability to test multiple NFIT configurations in the unit
test.
LIBNVDIMM: control class device in /sys/class
---------------------------------------------
This character device accepts DSM messages to be passed to DIMM
identified by its NFIT handle::
/sys/class/nd/ndctl0
|-- dev
|-- device -> ../../../ndbus0
|-- subsystem -> ../../../../../../../class/nd
LIBNVDIMM: bus
--------------
::
struct nvdimm_bus *nvdimm_bus_register(struct device *parent,
struct nvdimm_bus_descriptor *nfit_desc);
::
/sys/devices/platform/nfit_test.0/ndbus0
|-- commands
|-- nd
|-- nfit
|-- nmem0
|-- nmem1
|-- nmem2
|-- nmem3
|-- power
|-- provider
|-- region0
|-- region1
|-- region2
|-- region3
|-- region4
|-- region5
|-- uevent
`-- wait_probe
LIBNDCTL: bus enumeration example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Find the bus handle that describes the bus from Example NVDIMM Platform::
static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx,
const char *provider)
{
struct ndctl_bus *bus;
ndctl_bus_foreach(ctx, bus)
if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0)
return bus;
return NULL;
}
bus = get_bus_by_provider(ctx, "nfit_test.0");
LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
-------------------------------
The DIMM device provides a character device for sending commands to
hardware, and it is a container for LABELs. If the DIMM is defined by
NFIT then an optional 'nfit' attribute sub-directory is available to add
NFIT-specifics.
Note that the kernel device name for "DIMMs" is "nmemX". The NFIT
describes these devices via "Memory Device to System Physical Address
Range Mapping Structure", and there is no requirement that they actually
be physical DIMMs, so we use a more generic name.
LIBNVDIMM: DIMM (NMEM)
^^^^^^^^^^^^^^^^^^^^^^
::
struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,
const struct attribute_group **groups, unsigned long flags,
unsigned long *dsm_mask);
::
/sys/devices/platform/nfit_test.0/ndbus0
|-- nmem0
| |-- available_slots
| |-- commands
| |-- dev
| |-- devtype
| |-- driver -> ../../../../../bus/nd/drivers/nvdimm
| |-- modalias
| |-- nfit
| | |-- device
| | |-- format
| | |-- handle
| | |-- phys_id
| | |-- rev_id
| | |-- serial
| | `-- vendor
| |-- state
| |-- subsystem -> ../../../../../bus/nd
| `-- uevent
|-- nmem1
[..]
LIBNDCTL: DIMM enumeration example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note, in this example we are assuming NFIT-defined DIMMs which are
identified by an "nfit_handle" a 32-bit value where:
- Bit 3:0 DIMM number within the memory channel
- Bit 7:4 memory channel number
- Bit 11:8 memory controller ID
- Bit 15:12 socket ID (within scope of a Node controller if node
controller is present)
- Bit 27:16 Node Controller ID
- Bit 31:28 Reserved
::
static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus,
unsigned int handle)
{
struct ndctl_dimm *dimm;
ndctl_dimm_foreach(bus, dimm)
if (ndctl_dimm_get_handle(dimm) == handle)
return dimm;
return NULL;
}
#define DIMM_HANDLE(n, s, i, c, d) \
(((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \
| ((c & 0xf) << 4) | (d & 0xf))
dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0));
LIBNVDIMM/LIBNDCTL: Region
--------------------------
A generic REGION device is registered for each PMEM range or BLK-aperture
set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture
sets on the "nfit_test.0" bus. The primary role of regions are to be a
container of "mappings". A mapping is a tuple of <DIMM,
DPA-start-offset, length>.
LIBNVDIMM provides a built-in driver for these REGION devices. This driver
is responsible for reconciling the aliased DPA mappings across all
regions, parsing the LABEL, if present, and then emitting NAMESPACE
devices with the resolved/exclusive DPA-boundaries for the nd_pmem or
nd_blk device driver to consume.
In addition to the generic attributes of "mapping"s, "interleave_ways"
and "size" the REGION device also exports some convenience attributes.
"nstype" indicates the integer type of namespace-device this region
emits, "devtype" duplicates the DEVTYPE variable stored by udev at the
'add' event, "modalias" duplicates the MODALIAS variable stored by udev
at the 'add' event, and finally, the optional "spa_index" is provided in
the case where the region is defined by a SPA.
LIBNVDIMM: region::
struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
struct nd_region_desc *ndr_desc);
struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
struct nd_region_desc *ndr_desc);
::
/sys/devices/platform/nfit_test.0/ndbus0
|-- region0
| |-- available_size
| |-- btt0
| |-- btt_seed
| |-- devtype
| |-- driver -> ../../../../../bus/nd/drivers/nd_region
| |-- init_namespaces
| |-- mapping0
| |-- mapping1
| |-- mappings
| |-- modalias
| |-- namespace0.0
| |-- namespace_seed
| |-- numa_node
| |-- nfit
| | `-- spa_index
| |-- nstype
| |-- set_cookie
| |-- size
| |-- subsystem -> ../../../../../bus/nd
| `-- uevent
|-- region1
[..]
LIBNDCTL: region enumeration example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sample region retrieval routines based on NFIT-unique data like
"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for
BLK::
static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus,
unsigned int spa_index)
{
struct ndctl_region *region;
ndctl_region_foreach(bus, region) {
if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM)
continue;
if (ndctl_region_get_spa_index(region) == spa_index)
return region;
}
return NULL;
}
static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus,
unsigned int handle)
{
struct ndctl_region *region;
ndctl_region_foreach(bus, region) {
struct ndctl_mapping *map;
if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK)
continue;
ndctl_mapping_foreach(region, map) {
struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map);
if (ndctl_dimm_get_handle(dimm) == handle)
return region;
}
}
return NULL;
}
Why Not Encode the Region Type into the Region Name?
----------------------------------------------------
At first glance it seems since NFIT defines just PMEM and BLK interface
types that we should simply name REGION devices with something derived
from those type names. However, the ND subsystem explicitly keeps the
REGION name generic and expects userspace to always consider the
region-attributes for four reasons:
1. There are already more than two REGION and "namespace" types. For
PMEM there are two subtypes. As mentioned previously we have PMEM where
the constituent DIMM devices are known and anonymous PMEM. For BLK
regions the NFIT specification already anticipates vendor specific
implementations. The exact distinction of what a region contains is in
the region-attributes not the region-name or the region-devtype.
2. A region with zero child-namespaces is a possible configuration. For
example, the NFIT allows for a DCR to be published without a
corresponding BLK-aperture. This equates to a DIMM that can only accept
control/configuration messages, but no i/o through a descendant block
device. Again, this "type" is advertised in the attributes ('mappings'
== 0) and the name does not tell you much.
3. What if a third major interface type arises in the future? Outside
of vendor specific implementations, it's not difficult to envision a
third class of interface type beyond BLK and PMEM. With a generic name
for the REGION level of the device-hierarchy old userspace
implementations can still make sense of new kernel advertised
region-types. Userspace can always rely on the generic region
attributes like "mappings", "size", etc and the expected child devices
named "namespace". This generic format of the device-model hierarchy
allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and
future-proof.
4. There are more robust mechanisms for determining the major type of a
region than a device name. See the next section, How Do I Determine the
Major Type of a Region?
How Do I Determine the Major Type of a Region?
----------------------------------------------
Outside of the blanket recommendation of "use libndctl", or simply
looking at the kernel header (/usr/include/linux/ndctl.h) to decode the
"nstype" integer attribute, here are some other options.
1. module alias lookup
^^^^^^^^^^^^^^^^^^^^^^
The whole point of region/namespace device type differentiation is to
decide which block-device driver will attach to a given LIBNVDIMM namespace.
One can simply use the modalias to lookup the resulting module. It's
important to note that this method is robust in the presence of a
vendor-specific driver down the road. If a vendor-specific
implementation wants to supplant the standard nd_blk driver it can with
minimal impact to the rest of LIBNVDIMM.
In fact, a vendor may also want to have a vendor-specific region-driver
(outside of nd_region). For example, if a vendor defined its own LABEL
format it would need its own region driver to parse that LABEL and emit
the resulting namespaces. The output from module resolution is more
accurate than a region-name or region-devtype.
2. udev
^^^^^^^
The kernel "devtype" is registered in the udev database::
# udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0
P: /devices/platform/nfit_test.0/ndbus0/region0
E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0
E: DEVTYPE=nd_pmem
E: MODALIAS=nd:t2
E: SUBSYSTEM=nd
# udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4
P: /devices/platform/nfit_test.0/ndbus0/region4
E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4
E: DEVTYPE=nd_blk
E: MODALIAS=nd:t3
E: SUBSYSTEM=nd
...and is available as a region attribute, but keep in mind that the
"devtype" does not indicate sub-type variations and scripts should
really be understanding the other attributes.
3. type specific attributes
^^^^^^^^^^^^^^^^^^^^^^^^^^^
As it currently stands a BLK-aperture region will never have a
"nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A
BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM
that does not allow I/O. A PMEM region with a "mappings" value of zero
is a simple system-physical-address range.
LIBNVDIMM/LIBNDCTL: Namespace
-----------------------------
A REGION, after resolving DPA aliasing and LABEL specified boundaries,
surfaces one or more "namespace" devices. The arrival of a "namespace"
device currently triggers either the nd_blk or nd_pmem driver to load
and register a disk/block device.
LIBNVDIMM: namespace
^^^^^^^^^^^^^^^^^^^^
Here is a sample layout from the three major types of NAMESPACE where
namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid'
attribute), namespace2.0 represents a BLK namespace (note it has a
'sector_size' attribute) that, and namespace6.0 represents an anonymous
PMEM namespace (note that has no 'uuid' attribute due to not support a
LABEL)::
/sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0
|-- alt_name
|-- devtype
|-- dpa_extents
|-- force_raw
|-- modalias
|-- numa_node
|-- resource
|-- size
|-- subsystem -> ../../../../../../bus/nd
|-- type
|-- uevent
`-- uuid
/sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0
|-- alt_name
|-- devtype
|-- dpa_extents
|-- force_raw
|-- modalias
|-- numa_node
|-- sector_size
|-- size
|-- subsystem -> ../../../../../../bus/nd
|-- type
|-- uevent
`-- uuid
/sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0
|-- block
| `-- pmem0
|-- devtype
|-- driver -> ../../../../../../bus/nd/drivers/pmem
|-- force_raw
|-- modalias
|-- numa_node
|-- resource
|-- size
|-- subsystem -> ../../../../../../bus/nd
|-- type
`-- uevent
LIBNDCTL: namespace enumeration example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Namespaces are indexed relative to their parent region, example below.
These indexes are mostly static from boot to boot, but subsystem makes
no guarantees in this regard. For a static namespace identifier use its
'uuid' attribute.
::
static struct ndctl_namespace
*get_namespace_by_id(struct ndctl_region *region, unsigned int id)
{
struct ndctl_namespace *ndns;
ndctl_namespace_foreach(region, ndns)
if (ndctl_namespace_get_id(ndns) == id)
return ndns;
return NULL;
}
LIBNDCTL: namespace creation example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Idle namespaces are automatically created by the kernel if a given
region has enough available capacity to create a new namespace.
Namespace instantiation involves finding an idle namespace and
configuring it. For the most part the setting of namespace attributes
can occur in any order, the only constraint is that 'uuid' must be set
before 'size'. This enables the kernel to track DPA allocations
internally with a static identifier::
static int configure_namespace(struct ndctl_region *region,
struct ndctl_namespace *ndns,
struct namespace_parameters *parameters)
{
char devname[50];
snprintf(devname, sizeof(devname), "namespace%d.%d",
ndctl_region_get_id(region), paramaters->id);
ndctl_namespace_set_alt_name(ndns, devname);
/* 'uuid' must be set prior to setting size! */
ndctl_namespace_set_uuid(ndns, paramaters->uuid);
ndctl_namespace_set_size(ndns, paramaters->size);
/* unlike pmem namespaces, blk namespaces have a sector size */
if (parameters->lbasize)
ndctl_namespace_set_sector_size(ndns, parameters->lbasize);
ndctl_namespace_enable(ndns);
}
Why the Term "namespace"?
^^^^^^^^^^^^^^^^^^^^^^^^^
1. Why not "volume" for instance? "volume" ran the risk of confusing
ND (libnvdimm subsystem) to a volume manager like device-mapper.
2. The term originated to describe the sub-devices that can be created
within a NVME controller (see the nvme specification:
http://www.nvmexpress.org/specifications/), and NFIT namespaces are
meant to parallel the capabilities and configurability of
NVME-namespaces.
LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
-------------------------------------------------
A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked
block device driver that fronts either the whole block device or a
partition of a block device emitted by either a PMEM or BLK NAMESPACE.
LIBNVDIMM: btt layout
^^^^^^^^^^^^^^^^^^^^^
Every region will start out with at least one BTT device which is the
seed device. To activate it set the "namespace", "uuid", and
"sector_size" attributes and then bind the device to the nd_pmem or
nd_blk driver depending on the region type::
/sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/
|-- namespace
|-- delete
|-- devtype
|-- modalias
|-- numa_node
|-- sector_size
|-- subsystem -> ../../../../../bus/nd
|-- uevent
`-- uuid
LIBNDCTL: btt creation example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Similar to namespaces an idle BTT device is automatically created per
region. Each time this "seed" btt device is configured and enabled a new
seed is created. Creating a BTT configuration involves two steps of
finding and idle BTT and assigning it to consume a PMEM or BLK namespace::
static struct ndctl_btt *get_idle_btt(struct ndctl_region *region)
{
struct ndctl_btt *btt;
ndctl_btt_foreach(region, btt)
if (!ndctl_btt_is_enabled(btt)
&& !ndctl_btt_is_configured(btt))
return btt;
return NULL;
}
static int configure_btt(struct ndctl_region *region,
struct btt_parameters *parameters)
{
btt = get_idle_btt(region);
ndctl_btt_set_uuid(btt, parameters->uuid);
ndctl_btt_set_sector_size(btt, parameters->sector_size);
ndctl_btt_set_namespace(btt, parameters->ndns);
/* turn off raw mode device */
ndctl_namespace_disable(parameters->ndns);
/* turn on btt access */
ndctl_btt_enable(btt);
}
Once instantiated a new inactive btt seed device will appear underneath
the region.
Once a "namespace" is removed from a BTT that instance of the BTT device
will be deleted or otherwise reset to default values. This deletion is
only at the device model level. In order to destroy a BTT the "info
block" needs to be destroyed. Note, that to destroy a BTT the media
needs to be written in raw mode. By default, the kernel will autodetect
the presence of a BTT and disable raw mode. This autodetect behavior
can be suppressed by enabling raw mode for the namespace via the
ndctl_namespace_set_raw_mode() API.
Summary LIBNDCTL Diagram
------------------------
For the given example above, here is the view of the objects as seen by the
LIBNDCTL API::
+---+
|CTX| +---------+ +--------------+ +---------------+
+-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
| | +---------+ +--------------+ +---------------+
+-------+ | | +---------+ +--------------+ +---------------+
| DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" |
+-------+ | | | +---------+ +--------------+ +---------------+
| DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+
+-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" |
| DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+
+-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 |
| DIMM3 <-+ | +--------------+ +----------------------+
+-------+ | +---------+ +--------------+ +---------------+
+-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" |
| +---------+ | +--------------+ +----------------------+
| +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 |
| +--------------+ +----------------------+
| +---------+ +--------------+ +---------------+
+-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" |
| +---------+ +--------------+ +---------------+
| +---------+ +--------------+ +----------------------+
+-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 |
+---------+ +--------------+ +---------------+------+

View File

@@ -0,0 +1,143 @@
===============
NVDIMM Security
===============
1. Introduction
---------------
With the introduction of Intel Device Specific Methods (DSM) v1.8
specification [1], security DSMs are introduced. The spec added the following
security DSMs: "get security state", "set passphrase", "disable passphrase",
"unlock unit", "freeze lock", "secure erase", and "overwrite". A security_ops
data structure has been added to struct dimm in order to support the security
operations and generic APIs are exposed to allow vendor neutral operations.
2. Sysfs Interface
------------------
The "security" sysfs attribute is provided in the nvdimm sysfs directory. For
example:
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security
The "show" attribute of that attribute will display the security state for
that DIMM. The following states are available: disabled, unlocked, locked,
frozen, and overwrite. If security is not supported, the sysfs attribute
will not be visible.
The "store" attribute takes several commands when it is being written to
in order to support some of the security functionalities:
update <old_keyid> <new_keyid> - enable or update passphrase.
disable <keyid> - disable enabled security and remove key.
freeze - freeze changing of security states.
erase <keyid> - delete existing user encryption key.
overwrite <keyid> - wipe the entire nvdimm.
master_update <keyid> <new_keyid> - enable or update master passphrase.
master_erase <keyid> - delete existing user encryption key.
3. Key Management
-----------------
The key is associated to the payload by the DIMM id. For example:
# cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/nfit/id
8089-a2-1740-00000133
The DIMM id would be provided along with the key payload (passphrase) to
the kernel.
The security keys are managed on the basis of a single key per DIMM. The
key "passphrase" is expected to be 32bytes long. This is similar to the ATA
security specification [2]. A key is initially acquired via the request_key()
kernel API call during nvdimm unlock. It is up to the user to make sure that
all the keys are in the kernel user keyring for unlock.
A nvdimm encrypted-key of format enc32 has the description format of:
nvdimm:<bus-provider-specific-unique-id>
See file ``Documentation/security/keys/trusted-encrypted.rst`` for creating
encrypted-keys of enc32 format. TPM usage with a master trusted key is
preferred for sealing the encrypted-keys.
4. Unlocking
------------
When the DIMMs are being enumerated by the kernel, the kernel will attempt to
retrieve the key from the kernel user keyring. This is the only time
a locked DIMM can be unlocked. Once unlocked, the DIMM will remain unlocked
until reboot. Typically an entity (i.e. shell script) will inject all the
relevant encrypted-keys into the kernel user keyring during the initramfs phase.
This provides the unlock function access to all the related keys that contain
the passphrase for the respective nvdimms. It is also recommended that the
keys are injected before libnvdimm is loaded by modprobe.
5. Update
---------
When doing an update, it is expected that the existing key is removed from
the kernel user keyring and reinjected as different (old) key. It's irrelevant
what the key description is for the old key since we are only interested in the
keyid when doing the update operation. It is also expected that the new key
is injected with the description format described from earlier in this
document. The update command written to the sysfs attribute will be with
the format:
update <old keyid> <new keyid>
If there is no old keyid due to a security enabling, then a 0 should be
passed in.
6. Freeze
---------
The freeze operation does not require any keys. The security config can be
frozen by a user with root privelege.
7. Disable
----------
The security disable command format is:
disable <keyid>
An key with the current passphrase payload that is tied to the nvdimm should be
in the kernel user keyring.
8. Secure Erase
---------------
The command format for doing a secure erase is:
erase <keyid>
An key with the current passphrase payload that is tied to the nvdimm should be
in the kernel user keyring.
9. Overwrite
------------
The command format for doing an overwrite is:
overwrite <keyid>
Overwrite can be done without a key if security is not enabled. A key serial
of 0 can be passed in to indicate no key.
The sysfs attribute "security" can be polled to wait on overwrite completion.
Overwrite can last tens of minutes or more depending on nvdimm size.
An encrypted-key with the current user passphrase that is tied to the nvdimm
should be injected and its keyid should be passed in via sysfs.
10. Master Update
-----------------
The command format for doing a master update is:
update <old keyid> <new keyid>
The operating mechanism for master update is identical to update except the
master passphrase key is passed to the kernel. The master passphrase key
is just another encrypted-key.
This command is only available when security is disabled.
11. Master Erase
----------------
The command format for doing a master erase is:
master_erase <current keyid>
This command has the same operating mechanism as erase except the master
passphrase key is passed to the kernel. The master passphrase key is just
another encrypted-key.
This command is only available when the master security is enabled, indicated
by the extended security status.
[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf
[2]: http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf

View File

@@ -0,0 +1,189 @@
.. SPDX-License-Identifier: GPL-2.0
===============
NVMEM Subsystem
===============
Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
This document explains the NVMEM Framework along with the APIs provided,
and how to use it.
1. Introduction
===============
*NVMEM* is the abbreviation for Non Volatile Memory layer. It is used to
retrieve configuration of SOC or Device specific data from non volatile
memories like eeprom, efuses and so on.
Before this framework existed, NVMEM drivers like eeprom were stored in
drivers/misc, where they all had to duplicate pretty much the same code to
register a sysfs file, allow in-kernel users to access the content of the
devices they were driving, etc.
This was also a problem as far as other in-kernel users were involved, since
the solutions used were pretty much different from one driver to another, there
was a rather big abstraction leak.
This framework aims at solve these problems. It also introduces DT
representation for consumer devices to go get the data they require (MAC
Addresses, SoC/Revision ID, part numbers, and so on) from the NVMEMs. This
framework is based on regmap, so that most of the abstraction available in
regmap can be reused, across multiple types of buses.
NVMEM Providers
+++++++++++++++
NVMEM provider refers to an entity that implements methods to initialize, read
and write the non-volatile memory.
2. Registering/Unregistering the NVMEM provider
===============================================
A NVMEM provider can register with NVMEM core by supplying relevant
nvmem configuration to nvmem_register(), on success core would return a valid
nvmem_device pointer.
nvmem_unregister(nvmem) is used to unregister a previously registered provider.
For example, a simple qfprom case::
static struct nvmem_config econfig = {
.name = "qfprom",
.owner = THIS_MODULE,
};
static int qfprom_probe(struct platform_device *pdev)
{
...
econfig.dev = &pdev->dev;
nvmem = nvmem_register(&econfig);
...
}
It is mandatory that the NVMEM provider has a regmap associated with its
struct device. Failure to do would return error code from nvmem_register().
Users of board files can define and register nvmem cells using the
nvmem_cell_table struct::
static struct nvmem_cell_info foo_nvmem_cells[] = {
{
.name = "macaddr",
.offset = 0x7f00,
.bytes = ETH_ALEN,
}
};
static struct nvmem_cell_table foo_nvmem_cell_table = {
.nvmem_name = "i2c-eeprom",
.cells = foo_nvmem_cells,
.ncells = ARRAY_SIZE(foo_nvmem_cells),
};
nvmem_add_cell_table(&foo_nvmem_cell_table);
Additionally it is possible to create nvmem cell lookup entries and register
them with the nvmem framework from machine code as shown in the example below::
static struct nvmem_cell_lookup foo_nvmem_lookup = {
.nvmem_name = "i2c-eeprom",
.cell_name = "macaddr",
.dev_id = "foo_mac.0",
.con_id = "mac-address",
};
nvmem_add_cell_lookups(&foo_nvmem_lookup, 1);
NVMEM Consumers
+++++++++++++++
NVMEM consumers are the entities which make use of the NVMEM provider to
read from and to NVMEM.
3. NVMEM cell based consumer APIs
=================================
NVMEM cells are the data entries/fields in the NVMEM.
The NVMEM framework provides 3 APIs to read/write NVMEM cells::
struct nvmem_cell *nvmem_cell_get(struct device *dev, const char *name);
struct nvmem_cell *devm_nvmem_cell_get(struct device *dev, const char *name);
void nvmem_cell_put(struct nvmem_cell *cell);
void devm_nvmem_cell_put(struct device *dev, struct nvmem_cell *cell);
void *nvmem_cell_read(struct nvmem_cell *cell, ssize_t *len);
int nvmem_cell_write(struct nvmem_cell *cell, void *buf, ssize_t len);
`*nvmem_cell_get()` apis will get a reference to nvmem cell for a given id,
and nvmem_cell_read/write() can then read or write to the cell.
Once the usage of the cell is finished the consumer should call
`*nvmem_cell_put()` to free all the allocation memory for the cell.
4. Direct NVMEM device based consumer APIs
==========================================
In some instances it is necessary to directly read/write the NVMEM.
To facilitate such consumers NVMEM framework provides below apis::
struct nvmem_device *nvmem_device_get(struct device *dev, const char *name);
struct nvmem_device *devm_nvmem_device_get(struct device *dev,
const char *name);
void nvmem_device_put(struct nvmem_device *nvmem);
int nvmem_device_read(struct nvmem_device *nvmem, unsigned int offset,
size_t bytes, void *buf);
int nvmem_device_write(struct nvmem_device *nvmem, unsigned int offset,
size_t bytes, void *buf);
int nvmem_device_cell_read(struct nvmem_device *nvmem,
struct nvmem_cell_info *info, void *buf);
int nvmem_device_cell_write(struct nvmem_device *nvmem,
struct nvmem_cell_info *info, void *buf);
Before the consumers can read/write NVMEM directly, it should get hold
of nvmem_controller from one of the `*nvmem_device_get()` api.
The difference between these apis and cell based apis is that these apis always
take nvmem_device as parameter.
5. Releasing a reference to the NVMEM
=====================================
When a consumer no longer needs the NVMEM, it has to release the reference
to the NVMEM it has obtained using the APIs mentioned in the above section.
The NVMEM framework provides 2 APIs to release a reference to the NVMEM::
void nvmem_cell_put(struct nvmem_cell *cell);
void devm_nvmem_cell_put(struct device *dev, struct nvmem_cell *cell);
void nvmem_device_put(struct nvmem_device *nvmem);
void devm_nvmem_device_put(struct device *dev, struct nvmem_device *nvmem);
Both these APIs are used to release a reference to the NVMEM and
devm_nvmem_cell_put and devm_nvmem_device_put destroys the devres associated
with this NVMEM.
Userspace
+++++++++
6. Userspace binary interface
==============================
Userspace can read/write the raw NVMEM file located at::
/sys/bus/nvmem/devices/*/nvmem
ex::
hexdump /sys/bus/nvmem/devices/qfprom0/nvmem
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
00000a0 db10 2240 0000 e000 0c00 0c00 0000 0c00
0000000 0000 0000 0000 0000 0000 0000 0000 0000
...
*
0001000
7. DeviceTree Binding
=====================
See Documentation/devicetree/bindings/nvmem/nvmem.txt

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,18 @@
.. SPDX-License-Identifier: GPL-2.0
=====================
Generic PHY Framework
=====================
.. toctree::
phy
samsung-usb2
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@@ -0,0 +1,197 @@
=============
PHY subsystem
=============
:Author: Kishon Vijay Abraham I <kishon@ti.com>
This document explains the Generic PHY Framework along with the APIs provided,
and how-to-use.
Introduction
============
*PHY* is the abbreviation for physical layer. It is used to connect a device
to the physical medium e.g., the USB controller has a PHY to provide functions
such as serialization, de-serialization, encoding, decoding and is responsible
for obtaining the required data transmission rate. Note that some USB
controllers have PHY functionality embedded into it and others use an external
PHY. Other peripherals that use PHY include Wireless LAN, Ethernet,
SATA etc.
The intention of creating this framework is to bring the PHY drivers spread
all over the Linux kernel to drivers/phy to increase code re-use and for
better code maintainability.
This framework will be of use only to devices that use external PHY (PHY
functionality is not embedded within the controller).
Registering/Unregistering the PHY provider
==========================================
PHY provider refers to an entity that implements one or more PHY instances.
For the simple case where the PHY provider implements only a single instance of
the PHY, the framework provides its own implementation of of_xlate in
of_phy_simple_xlate. If the PHY provider implements multiple instances, it
should provide its own implementation of of_xlate. of_xlate is used only for
dt boot case.
::
#define of_phy_provider_register(dev, xlate) \
__of_phy_provider_register((dev), NULL, THIS_MODULE, (xlate))
#define devm_of_phy_provider_register(dev, xlate) \
__devm_of_phy_provider_register((dev), NULL, THIS_MODULE,
(xlate))
of_phy_provider_register and devm_of_phy_provider_register macros can be used to
register the phy_provider and it takes device and of_xlate as
arguments. For the dt boot case, all PHY providers should use one of the above
2 macros to register the PHY provider.
Often the device tree nodes associated with a PHY provider will contain a set
of children that each represent a single PHY. Some bindings may nest the child
nodes within extra levels for context and extensibility, in which case the low
level of_phy_provider_register_full() and devm_of_phy_provider_register_full()
macros can be used to override the node containing the children.
::
#define of_phy_provider_register_full(dev, children, xlate) \
__of_phy_provider_register(dev, children, THIS_MODULE, xlate)
#define devm_of_phy_provider_register_full(dev, children, xlate) \
__devm_of_phy_provider_register_full(dev, children,
THIS_MODULE, xlate)
void devm_of_phy_provider_unregister(struct device *dev,
struct phy_provider *phy_provider);
void of_phy_provider_unregister(struct phy_provider *phy_provider);
devm_of_phy_provider_unregister and of_phy_provider_unregister can be used to
unregister the PHY.
Creating the PHY
================
The PHY driver should create the PHY in order for other peripheral controllers
to make use of it. The PHY framework provides 2 APIs to create the PHY.
::
struct phy *phy_create(struct device *dev, struct device_node *node,
const struct phy_ops *ops);
struct phy *devm_phy_create(struct device *dev,
struct device_node *node,
const struct phy_ops *ops);
The PHY drivers can use one of the above 2 APIs to create the PHY by passing
the device pointer and phy ops.
phy_ops is a set of function pointers for performing PHY operations such as
init, exit, power_on and power_off.
Inorder to dereference the private data (in phy_ops), the phy provider driver
can use phy_set_drvdata() after creating the PHY and use phy_get_drvdata() in
phy_ops to get back the private data.
4. Getting a reference to the PHY
Before the controller can make use of the PHY, it has to get a reference to
it. This framework provides the following APIs to get a reference to the PHY.
::
struct phy *phy_get(struct device *dev, const char *string);
struct phy *phy_optional_get(struct device *dev, const char *string);
struct phy *devm_phy_get(struct device *dev, const char *string);
struct phy *devm_phy_optional_get(struct device *dev,
const char *string);
struct phy *devm_of_phy_get_by_index(struct device *dev,
struct device_node *np,
int index);
phy_get, phy_optional_get, devm_phy_get and devm_phy_optional_get can
be used to get the PHY. In the case of dt boot, the string arguments
should contain the phy name as given in the dt data and in the case of
non-dt boot, it should contain the label of the PHY. The two
devm_phy_get associates the device with the PHY using devres on
successful PHY get. On driver detach, release function is invoked on
the devres data and devres data is freed. phy_optional_get and
devm_phy_optional_get should be used when the phy is optional. These
two functions will never return -ENODEV, but instead returns NULL when
the phy cannot be found.Some generic drivers, such as ehci, may use multiple
phys and for such drivers referencing phy(s) by name(s) does not make sense. In
this case, devm_of_phy_get_by_index can be used to get a phy reference based on
the index.
It should be noted that NULL is a valid phy reference. All phy
consumer calls on the NULL phy become NOPs. That is the release calls,
the phy_init() and phy_exit() calls, and phy_power_on() and
phy_power_off() calls are all NOP when applied to a NULL phy. The NULL
phy is useful in devices for handling optional phy devices.
Releasing a reference to the PHY
================================
When the controller no longer needs the PHY, it has to release the reference
to the PHY it has obtained using the APIs mentioned in the above section. The
PHY framework provides 2 APIs to release a reference to the PHY.
::
void phy_put(struct phy *phy);
void devm_phy_put(struct device *dev, struct phy *phy);
Both these APIs are used to release a reference to the PHY and devm_phy_put
destroys the devres associated with this PHY.
Destroying the PHY
==================
When the driver that created the PHY is unloaded, it should destroy the PHY it
created using one of the following 2 APIs::
void phy_destroy(struct phy *phy);
void devm_phy_destroy(struct device *dev, struct phy *phy);
Both these APIs destroy the PHY and devm_phy_destroy destroys the devres
associated with this PHY.
PM Runtime
==========
This subsystem is pm runtime enabled. So while creating the PHY,
pm_runtime_enable of the phy device created by this subsystem is called and
while destroying the PHY, pm_runtime_disable is called. Note that the phy
device created by this subsystem will be a child of the device that calls
phy_create (PHY provider device).
So pm_runtime_get_sync of the phy_device created by this subsystem will invoke
pm_runtime_get_sync of PHY provider device because of parent-child relationship.
It should also be noted that phy_power_on and phy_power_off performs
phy_pm_runtime_get_sync and phy_pm_runtime_put respectively.
There are exported APIs like phy_pm_runtime_get, phy_pm_runtime_get_sync,
phy_pm_runtime_put, phy_pm_runtime_put_sync, phy_pm_runtime_allow and
phy_pm_runtime_forbid for performing PM operations.
PHY Mappings
============
In order to get reference to a PHY without help from DeviceTree, the framework
offers lookups which can be compared to clkdev that allow clk structures to be
bound to devices. A lookup can be made be made during runtime when a handle to
the struct phy already exists.
The framework offers the following API for registering and unregistering the
lookups::
int phy_create_lookup(struct phy *phy, const char *con_id,
const char *dev_id);
void phy_remove_lookup(struct phy *phy, const char *con_id,
const char *dev_id);
DeviceTree Binding
==================
The documentation for PHY dt binding can be found @
Documentation/devicetree/bindings/phy/phy-bindings.txt

View File

@@ -0,0 +1,137 @@
====================================
Samsung USB 2.0 PHY adaptation layer
====================================
1. Description
--------------
The architecture of the USB 2.0 PHY module in Samsung SoCs is similar
among many SoCs. In spite of the similarities it proved difficult to
create a one driver that would fit all these PHY controllers. Often
the differences were minor and were found in particular bits of the
registers of the PHY. In some rare cases the order of register writes or
the PHY powering up process had to be altered. This adaptation layer is
a compromise between having separate drivers and having a single driver
with added support for many special cases.
2. Files description
--------------------
- phy-samsung-usb2.c
This is the main file of the adaptation layer. This file contains
the probe function and provides two callbacks to the Generic PHY
Framework. This two callbacks are used to power on and power off the
phy. They carry out the common work that has to be done on all version
of the PHY module. Depending on which SoC was chosen they execute SoC
specific callbacks. The specific SoC version is selected by choosing
the appropriate compatible string. In addition, this file contains
struct of_device_id definitions for particular SoCs.
- phy-samsung-usb2.h
This is the include file. It declares the structures used by this
driver. In addition it should contain extern declarations for
structures that describe particular SoCs.
3. Supporting SoCs
------------------
To support a new SoC a new file should be added to the drivers/phy
directory. Each SoC's configuration is stored in an instance of the
struct samsung_usb2_phy_config::
struct samsung_usb2_phy_config {
const struct samsung_usb2_common_phy *phys;
int (*rate_to_clk)(unsigned long, u32 *);
unsigned int num_phys;
bool has_mode_switch;
};
The num_phys is the number of phys handled by the driver. `*phys` is an
array that contains the configuration for each phy. The has_mode_switch
property is a boolean flag that determines whether the SoC has USB host
and device on a single pair of pins. If so, a special register has to
be modified to change the internal routing of these pins between a USB
device or host module.
For example the configuration for Exynos 4210 is following::
const struct samsung_usb2_phy_config exynos4210_usb2_phy_config = {
.has_mode_switch = 0,
.num_phys = EXYNOS4210_NUM_PHYS,
.phys = exynos4210_phys,
.rate_to_clk = exynos4210_rate_to_clk,
}
- `int (*rate_to_clk)(unsigned long, u32 *)`
The rate_to_clk callback is to convert the rate of the clock
used as the reference clock for the PHY module to the value
that should be written in the hardware register.
The exynos4210_phys configuration array is as follows::
static const struct samsung_usb2_common_phy exynos4210_phys[] = {
{
.label = "device",
.id = EXYNOS4210_DEVICE,
.power_on = exynos4210_power_on,
.power_off = exynos4210_power_off,
},
{
.label = "host",
.id = EXYNOS4210_HOST,
.power_on = exynos4210_power_on,
.power_off = exynos4210_power_off,
},
{
.label = "hsic0",
.id = EXYNOS4210_HSIC0,
.power_on = exynos4210_power_on,
.power_off = exynos4210_power_off,
},
{
.label = "hsic1",
.id = EXYNOS4210_HSIC1,
.power_on = exynos4210_power_on,
.power_off = exynos4210_power_off,
},
{},
};
- `int (*power_on)(struct samsung_usb2_phy_instance *);`
`int (*power_off)(struct samsung_usb2_phy_instance *);`
These two callbacks are used to power on and power off the phy
by modifying appropriate registers.
Final change to the driver is adding appropriate compatible value to the
phy-samsung-usb2.c file. In case of Exynos 4210 the following lines were
added to the struct of_device_id samsung_usb2_phy_of_match[] array::
#ifdef CONFIG_PHY_EXYNOS4210_USB2
{
.compatible = "samsung,exynos4210-usb2-phy",
.data = &exynos4210_usb2_phy_config,
},
#endif
To add further flexibility to the driver the Kconfig file enables to
include support for selected SoCs in the compiled driver. The Kconfig
entry for Exynos 4210 is following::
config PHY_EXYNOS4210_USB2
bool "Support for Exynos 4210"
depends on PHY_SAMSUNG_USB2
depends on CPU_EXYNOS4210
help
Enable USB PHY support for Exynos 4210. This option requires that
Samsung USB 2.0 PHY driver is enabled and means that support for this
particular SoC is compiled in the driver. In case of Exynos 4210 four
phys are available - device, host, HSCI0 and HSCI1.
The newly created file that supports the new SoC has to be also added to the
Makefile. In case of Exynos 4210 the added line is following::
obj-$(CONFIG_PHY_EXYNOS4210_USB2) += phy-exynos4210-usb2.o
After completing these steps the support for the new SoC should be ready.

View File

@@ -1,4 +1,4 @@
:orphan:
.. SPDX-License-Identifier: GPL-2.0
======================
PPS - Pulse Per Second

View File

@@ -0,0 +1,106 @@
.. SPDX-License-Identifier: GPL-2.0
=============
Intel MID PTI
=============
The Intel MID PTI project is HW implemented in Intel Atom
system-on-a-chip designs based on the Parallel Trace
Interface for MIPI P1149.7 cJTAG standard. The kernel solution
for this platform involves the following files::
./include/linux/pti.h
./drivers/.../n_tracesink.h
./drivers/.../n_tracerouter.c
./drivers/.../n_tracesink.c
./drivers/.../pti.c
pti.c is the driver that enables various debugging features
popular on platforms from certain mobile manufacturers.
n_tracerouter.c and n_tracesink.c allow extra system information to
be collected and routed to the pti driver, such as trace
debugging data from a modem. Although n_tracerouter
and n_tracesink are a part of the complete PTI solution,
these two line disciplines can work separately from
pti.c and route any data stream from one /dev/tty node
to another /dev/tty node via kernel-space. This provides
a stable, reliable connection that will not break unless
the user-space application shuts down (plus avoids
kernel->user->kernel context switch overheads of routing
data).
An example debugging usage for this driver system:
* Hook /dev/ttyPTI0 to syslogd. Opening this port will also start
a console device to further capture debugging messages to PTI.
* Hook /dev/ttyPTI1 to modem debugging data to write to PTI HW.
This is where n_tracerouter and n_tracesink are used.
* Hook /dev/pti to a user-level debugging application for writing
to PTI HW.
* `Use mipi_` Kernel Driver API in other device drivers for
debugging to PTI by first requesting a PTI write address via
mipi_request_masterchannel(1).
Below is example pseudo-code on how a 'privileged' application
can hook up n_tracerouter and n_tracesink to any tty on
a system. 'Privileged' means the application has enough
privileges to successfully manipulate the ldisc drivers
but is not just blindly executing as 'root'. Keep in mind
the use of ioctl(,TIOCSETD,) is not specific to the n_tracerouter
and n_tracesink line discpline drivers but is a generic
operation for a program to use a line discpline driver
on a tty port other than the default n_tty::
/////////// To hook up n_tracerouter and n_tracesink /////////
// Note that n_tracerouter depends on n_tracesink.
#include <errno.h>
#define ONE_TTY "/dev/ttyOne"
#define TWO_TTY "/dev/ttyTwo"
// needed global to hand onto ldisc connection
static int g_fd_source = -1;
static int g_fd_sink = -1;
// these two vars used to grab LDISC values from loaded ldisc drivers
// in OS. Look at /proc/tty/ldiscs to get the right numbers from
// the ldiscs loaded in the system.
int source_ldisc_num, sink_ldisc_num = -1;
int retval;
g_fd_source = open(ONE_TTY, O_RDWR); // must be R/W
g_fd_sink = open(TWO_TTY, O_RDWR); // must be R/W
if (g_fd_source <= 0) || (g_fd_sink <= 0) {
// doubt you'll want to use these exact error lines of code
printf("Error on open(). errno: %d\n",errno);
return errno;
}
retval = ioctl(g_fd_sink, TIOCSETD, &sink_ldisc_num);
if (retval < 0) {
printf("Error on ioctl(). errno: %d\n", errno);
return errno;
}
retval = ioctl(g_fd_source, TIOCSETD, &source_ldisc_num);
if (retval < 0) {
printf("Error on ioctl(). errno: %d\n", errno);
return errno;
}
/////////// To disconnect n_tracerouter and n_tracesink ////////
// First make sure data through the ldiscs has stopped.
// Second, disconnect ldiscs. This provides a
// little cleaner shutdown on tty stack.
sink_ldisc_num = 0;
source_ldisc_num = 0;
ioctl(g_fd_uart, TIOCSETD, &sink_ldisc_num);
ioctl(g_fd_gadget, TIOCSETD, &source_ldisc_num);
// Three, program closes connection, and cleanup:
close(g_fd_uart);
close(g_fd_gadget);
g_fd_uart = g_fd_gadget = NULL;

View File

@@ -1,4 +1,4 @@
:orphan:
.. SPDX-License-Identifier: GPL-2.0
===========================================
PTP hardware clock infrastructure for Linux

View File

@@ -0,0 +1,165 @@
======================================
Pulse Width Modulation (PWM) interface
======================================
This provides an overview about the Linux PWM interface
PWMs are commonly used for controlling LEDs, fans or vibrators in
cell phones. PWMs with a fixed purpose have no need implementing
the Linux PWM API (although they could). However, PWMs are often
found as discrete devices on SoCs which have no fixed purpose. It's
up to the board designer to connect them to LEDs or fans. To provide
this kind of flexibility the generic PWM API exists.
Identifying PWMs
----------------
Users of the legacy PWM API use unique IDs to refer to PWM devices.
Instead of referring to a PWM device via its unique ID, board setup code
should instead register a static mapping that can be used to match PWM
consumers to providers, as given in the following example::
static struct pwm_lookup board_pwm_lookup[] = {
PWM_LOOKUP("tegra-pwm", 0, "pwm-backlight", NULL,
50000, PWM_POLARITY_NORMAL),
};
static void __init board_init(void)
{
...
pwm_add_table(board_pwm_lookup, ARRAY_SIZE(board_pwm_lookup));
...
}
Using PWMs
----------
Legacy users can request a PWM device using pwm_request() and free it
after usage with pwm_free().
New users should use the pwm_get() function and pass to it the consumer
device or a consumer name. pwm_put() is used to free the PWM device. Managed
variants of these functions, devm_pwm_get() and devm_pwm_put(), also exist.
After being requested, a PWM has to be configured using::
int pwm_apply_state(struct pwm_device *pwm, struct pwm_state *state);
This API controls both the PWM period/duty_cycle config and the
enable/disable state.
The pwm_config(), pwm_enable() and pwm_disable() functions are just wrappers
around pwm_apply_state() and should not be used if the user wants to change
several parameter at once. For example, if you see pwm_config() and
pwm_{enable,disable}() calls in the same function, this probably means you
should switch to pwm_apply_state().
The PWM user API also allows one to query the PWM state with pwm_get_state().
In addition to the PWM state, the PWM API also exposes PWM arguments, which
are the reference PWM config one should use on this PWM.
PWM arguments are usually platform-specific and allows the PWM user to only
care about dutycycle relatively to the full period (like, duty = 50% of the
period). struct pwm_args contains 2 fields (period and polarity) and should
be used to set the initial PWM config (usually done in the probe function
of the PWM user). PWM arguments are retrieved with pwm_get_args().
All consumers should really be reconfiguring the PWM upon resume as
appropriate. This is the only way to ensure that everything is resumed in
the proper order.
Using PWMs with the sysfs interface
-----------------------------------
If CONFIG_SYSFS is enabled in your kernel configuration a simple sysfs
interface is provided to use the PWMs from userspace. It is exposed at
/sys/class/pwm/. Each probed PWM controller/chip will be exported as
pwmchipN, where N is the base of the PWM chip. Inside the directory you
will find:
npwm
The number of PWM channels this chip supports (read-only).
export
Exports a PWM channel for use with sysfs (write-only).
unexport
Unexports a PWM channel from sysfs (write-only).
The PWM channels are numbered using a per-chip index from 0 to npwm-1.
When a PWM channel is exported a pwmX directory will be created in the
pwmchipN directory it is associated with, where X is the number of the
channel that was exported. The following properties will then be available:
period
The total period of the PWM signal (read/write).
Value is in nanoseconds and is the sum of the active and inactive
time of the PWM.
duty_cycle
The active time of the PWM signal (read/write).
Value is in nanoseconds and must be less than the period.
polarity
Changes the polarity of the PWM signal (read/write).
Writes to this property only work if the PWM chip supports changing
the polarity. The polarity can only be changed if the PWM is not
enabled. Value is the string "normal" or "inversed".
enable
Enable/disable the PWM signal (read/write).
- 0 - disabled
- 1 - enabled
Implementing a PWM driver
-------------------------
Currently there are two ways to implement pwm drivers. Traditionally
there only has been the barebone API meaning that each driver has
to implement the pwm_*() functions itself. This means that it's impossible
to have multiple PWM drivers in the system. For this reason it's mandatory
for new drivers to use the generic PWM framework.
A new PWM controller/chip can be added using pwmchip_add() and removed
again with pwmchip_remove(). pwmchip_add() takes a filled in struct
pwm_chip as argument which provides a description of the PWM chip, the
number of PWM devices provided by the chip and the chip-specific
implementation of the supported PWM operations to the framework.
When implementing polarity support in a PWM driver, make sure to respect the
signal conventions in the PWM framework. By definition, normal polarity
characterizes a signal starts high for the duration of the duty cycle and
goes low for the remainder of the period. Conversely, a signal with inversed
polarity starts low for the duration of the duty cycle and goes high for the
remainder of the period.
Drivers are encouraged to implement ->apply() instead of the legacy
->enable(), ->disable() and ->config() methods. Doing that should provide
atomicity in the PWM config workflow, which is required when the PWM controls
a critical device (like a regulator).
The implementation of ->get_state() (a method used to retrieve initial PWM
state) is also encouraged for the same reason: letting the PWM user know
about the current PWM state would allow him to avoid glitches.
Drivers should not implement any power management. In other words,
consumers should implement it as described in the "Using PWMs" section.
Locking
-------
The PWM core list manipulations are protected by a mutex, so pwm_request()
and pwm_free() may not be called from an atomic context. Currently the
PWM core does not enforce any locking to pwm_enable(), pwm_disable() and
pwm_config(), so the calling context is currently driver specific. This
is an issue derived from the former barebone API and should be fixed soon.
Helpers
-------
Currently a PWM can only be configured with period_ns and duty_ns. For several
use cases freq_hz and duty_percent might be better. Instead of calculating
this in your driver please consider adding appropriate helpers to the framework.

View File

@@ -1,107 +0,0 @@
=======================
RapidIO Subsystem Guide
=======================
:Author: Matt Porter
Introduction
============
RapidIO is a high speed switched fabric interconnect with features aimed
at the embedded market. RapidIO provides support for memory-mapped I/O
as well as message-based transactions over the switched fabric network.
RapidIO has a standardized discovery mechanism not unlike the PCI bus
standard that allows simple detection of devices in a network.
This documentation is provided for developers intending to support
RapidIO on new architectures, write new drivers, or to understand the
subsystem internals.
Known Bugs and Limitations
==========================
Bugs
----
None. ;)
Limitations
-----------
1. Access/management of RapidIO memory regions is not supported
2. Multiple host enumeration is not supported
RapidIO driver interface
========================
Drivers are provided a set of calls in order to interface with the
subsystem to gather info on devices, request/map memory region
resources, and manage mailboxes/doorbells.
Functions
---------
.. kernel-doc:: include/linux/rio_drv.h
:internal:
.. kernel-doc:: drivers/rapidio/rio-driver.c
:export:
.. kernel-doc:: drivers/rapidio/rio.c
:export:
Internals
=========
This chapter contains the autogenerated documentation of the RapidIO
subsystem.
Structures
----------
.. kernel-doc:: include/linux/rio.h
:internal:
Enumeration and Discovery
-------------------------
.. kernel-doc:: drivers/rapidio/rio-scan.c
:internal:
Driver functionality
--------------------
.. kernel-doc:: drivers/rapidio/rio.c
:internal:
.. kernel-doc:: drivers/rapidio/rio-access.c
:internal:
Device model support
--------------------
.. kernel-doc:: drivers/rapidio/rio-driver.c
:internal:
PPC32 support
-------------
.. kernel-doc:: arch/powerpc/sysdev/fsl_rio.c
:internal:
Credits
=======
The following people have contributed to the RapidIO subsystem directly
or indirectly:
1. Matt Porter\ mporter@kernel.crashing.org
2. Randy Vinson\ rvinson@mvista.com
3. Dan Malek\ dan@embeddedalley.com
The following people have contributed to this document:
1. Matt Porter\ mporter@kernel.crashing.org

View File

@@ -0,0 +1,15 @@
.. SPDX-License-Identifier: GPL-2.0
===========================
The Linux RapidIO Subsystem
===========================
.. toctree::
:maxdepth: 1
rapidio
sysfs
tsi721
mport_cdev
rio_cm

View File

@@ -0,0 +1,110 @@
==================================================================
RapidIO subsystem mport character device driver (rio_mport_cdev.c)
==================================================================
1. Overview
===========
This device driver is the result of collaboration within the RapidIO.org
Software Task Group (STG) between Texas Instruments, Freescale,
Prodrive Technologies, Nokia Networks, BAE and IDT. Additional input was
received from other members of RapidIO.org. The objective was to create a
character mode driver interface which exposes the capabilities of RapidIO
devices directly to applications, in a manner that allows the numerous and
varied RapidIO implementations to interoperate.
This driver (MPORT_CDEV) provides access to basic RapidIO subsystem operations
for user-space applications. Most of RapidIO operations are supported through
'ioctl' system calls.
When loaded this device driver creates filesystem nodes named rio_mportX in /dev
directory for each registered RapidIO mport device. 'X' in the node name matches
to unique port ID assigned to each local mport device.
Using available set of ioctl commands user-space applications can perform
following RapidIO bus and subsystem operations:
- Reads and writes from/to configuration registers of mport devices
(RIO_MPORT_MAINT_READ_LOCAL/RIO_MPORT_MAINT_WRITE_LOCAL)
- Reads and writes from/to configuration registers of remote RapidIO devices.
This operations are defined as RapidIO Maintenance reads/writes in RIO spec.
(RIO_MPORT_MAINT_READ_REMOTE/RIO_MPORT_MAINT_WRITE_REMOTE)
- Set RapidIO Destination ID for mport devices (RIO_MPORT_MAINT_HDID_SET)
- Set RapidIO Component Tag for mport devices (RIO_MPORT_MAINT_COMPTAG_SET)
- Query logical index of mport devices (RIO_MPORT_MAINT_PORT_IDX_GET)
- Query capabilities and RapidIO link configuration of mport devices
(RIO_MPORT_GET_PROPERTIES)
- Enable/Disable reporting of RapidIO doorbell events to user-space applications
(RIO_ENABLE_DOORBELL_RANGE/RIO_DISABLE_DOORBELL_RANGE)
- Enable/Disable reporting of RIO port-write events to user-space applications
(RIO_ENABLE_PORTWRITE_RANGE/RIO_DISABLE_PORTWRITE_RANGE)
- Query/Control type of events reported through this driver: doorbells,
port-writes or both (RIO_SET_EVENT_MASK/RIO_GET_EVENT_MASK)
- Configure/Map mport's outbound requests window(s) for specific size,
RapidIO destination ID, hopcount and request type
(RIO_MAP_OUTBOUND/RIO_UNMAP_OUTBOUND)
- Configure/Map mport's inbound requests window(s) for specific size,
RapidIO base address and local memory base address
(RIO_MAP_INBOUND/RIO_UNMAP_INBOUND)
- Allocate/Free contiguous DMA coherent memory buffer for DMA data transfers
to/from remote RapidIO devices (RIO_ALLOC_DMA/RIO_FREE_DMA)
- Initiate DMA data transfers to/from remote RapidIO devices (RIO_TRANSFER).
Supports blocking, asynchronous and posted (a.k.a 'fire-and-forget') data
transfer modes.
- Check/Wait for completion of asynchronous DMA data transfer
(RIO_WAIT_FOR_ASYNC)
- Manage device objects supported by RapidIO subsystem (RIO_DEV_ADD/RIO_DEV_DEL).
This allows implementation of various RapidIO fabric enumeration algorithms
as user-space applications while using remaining functionality provided by
kernel RapidIO subsystem.
2. Hardware Compatibility
=========================
This device driver uses standard interfaces defined by kernel RapidIO subsystem
and therefore it can be used with any mport device driver registered by RapidIO
subsystem with limitations set by available mport implementation.
At this moment the most common limitation is availability of RapidIO-specific
DMA engine framework for specific mport device. Users should verify available
functionality of their platform when planning to use this driver:
- IDT Tsi721 PCIe-to-RapidIO bridge device and its mport device driver are fully
compatible with this driver.
- Freescale SoCs 'fsl_rio' mport driver does not have implementation for RapidIO
specific DMA engine support and therefore DMA data transfers mport_cdev driver
are not available.
3. Module parameters
====================
- 'dma_timeout'
- DMA transfer completion timeout (in msec, default value 3000).
This parameter set a maximum completion wait time for SYNC mode DMA
transfer requests and for RIO_WAIT_FOR_ASYNC ioctl requests.
- 'dbg_level'
- This parameter allows to control amount of debug information
generated by this device driver. This parameter is formed by set of
bit masks that correspond to the specific functional blocks.
For mask definitions see 'drivers/rapidio/devices/rio_mport_cdev.c'
This parameter can be changed dynamically.
Use CONFIG_RAPIDIO_DEBUG=y to enable debug output at the top level.
4. Known problems
=================
None.
5. User-space Applications and API
==================================
API library and applications that use this device driver are available from
RapidIO.org.
6. TODO List
============
- Add support for sending/receiving "raw" RapidIO messaging packets.
- Add memory mapped DMA data transfers as an option when RapidIO-specific DMA
is not available.

View File

@@ -0,0 +1,362 @@
============
Introduction
============
The RapidIO standard is a packet-based fabric interconnect standard designed for
use in embedded systems. Development of the RapidIO standard is directed by the
RapidIO Trade Association (RTA). The current version of the RapidIO specification
is publicly available for download from the RTA web-site [1].
This document describes the basics of the Linux RapidIO subsystem and provides
information on its major components.
1 Overview
==========
Because the RapidIO subsystem follows the Linux device model it is integrated
into the kernel similarly to other buses by defining RapidIO-specific device and
bus types and registering them within the device model.
The Linux RapidIO subsystem is architecture independent and therefore defines
architecture-specific interfaces that provide support for common RapidIO
subsystem operations.
2. Core Components
==================
A typical RapidIO network is a combination of endpoints and switches.
Each of these components is represented in the subsystem by an associated data
structure. The core logical components of the RapidIO subsystem are defined
in include/linux/rio.h file.
2.1 Master Port
---------------
A master port (or mport) is a RapidIO interface controller that is local to the
processor executing the Linux code. A master port generates and receives RapidIO
packets (transactions). In the RapidIO subsystem each master port is represented
by a rio_mport data structure. This structure contains master port specific
resources such as mailboxes and doorbells. The rio_mport also includes a unique
host device ID that is valid when a master port is configured as an enumerating
host.
RapidIO master ports are serviced by subsystem specific mport device drivers
that provide functionality defined for this subsystem. To provide a hardware
independent interface for RapidIO subsystem operations, rio_mport structure
includes rio_ops data structure which contains pointers to hardware specific
implementations of RapidIO functions.
2.2 Device
----------
A RapidIO device is any endpoint (other than mport) or switch in the network.
All devices are presented in the RapidIO subsystem by corresponding rio_dev data
structure. Devices form one global device list and per-network device lists
(depending on number of available mports and networks).
2.3 Switch
----------
A RapidIO switch is a special class of device that routes packets between its
ports towards their final destination. The packet destination port within a
switch is defined by an internal routing table. A switch is presented in the
RapidIO subsystem by rio_dev data structure expanded by additional rio_switch
data structure, which contains switch specific information such as copy of the
routing table and pointers to switch specific functions.
The RapidIO subsystem defines the format and initialization method for subsystem
specific switch drivers that are designed to provide hardware-specific
implementation of common switch management routines.
2.4 Network
-----------
A RapidIO network is a combination of interconnected endpoint and switch devices.
Each RapidIO network known to the system is represented by corresponding rio_net
data structure. This structure includes lists of all devices and local master
ports that form the same network. It also contains a pointer to the default
master port that is used to communicate with devices within the network.
2.5 Device Drivers
------------------
RapidIO device-specific drivers follow Linux Kernel Driver Model and are
intended to support specific RapidIO devices attached to the RapidIO network.
2.6 Subsystem Interfaces
------------------------
RapidIO interconnect specification defines features that may be used to provide
one or more common service layers for all participating RapidIO devices. These
common services may act separately from device-specific drivers or be used by
device-specific drivers. Example of such service provider is the RIONET driver
which implements Ethernet-over-RapidIO interface. Because only one driver can be
registered for a device, all common RapidIO services have to be registered as
subsystem interfaces. This allows to have multiple common services attached to
the same device without blocking attachment of a device-specific driver.
3. Subsystem Initialization
===========================
In order to initialize the RapidIO subsystem, a platform must initialize and
register at least one master port within the RapidIO network. To register mport
within the subsystem controller driver's initialization code calls function
rio_register_mport() for each available master port.
After all active master ports are registered with a RapidIO subsystem,
an enumeration and/or discovery routine may be called automatically or
by user-space command.
RapidIO subsystem can be configured to be built as a statically linked or
modular component of the kernel (see details below).
4. Enumeration and Discovery
============================
4.1 Overview
------------
RapidIO subsystem configuration options allow users to build enumeration and
discovery methods as statically linked components or loadable modules.
An enumeration/discovery method implementation and available input parameters
define how any given method can be attached to available RapidIO mports:
simply to all available mports OR individually to the specified mport device.
Depending on selected enumeration/discovery build configuration, there are
several methods to initiate an enumeration and/or discovery process:
(a) Statically linked enumeration and discovery process can be started
automatically during kernel initialization time using corresponding module
parameters. This was the original method used since introduction of RapidIO
subsystem. Now this method relies on enumerator module parameter which is
'rio-scan.scan' for existing basic enumeration/discovery method.
When automatic start of enumeration/discovery is used a user has to ensure
that all discovering endpoints are started before the enumerating endpoint
and are waiting for enumeration to be completed.
Configuration option CONFIG_RAPIDIO_DISC_TIMEOUT defines time that discovering
endpoint waits for enumeration to be completed. If the specified timeout
expires the discovery process is terminated without obtaining RapidIO network
information. NOTE: a timed out discovery process may be restarted later using
a user-space command as it is described below (if the given endpoint was
enumerated successfully).
(b) Statically linked enumeration and discovery process can be started by
a command from user space. This initiation method provides more flexibility
for a system startup compared to the option (a) above. After all participating
endpoints have been successfully booted, an enumeration process shall be
started first by issuing a user-space command, after an enumeration is
completed a discovery process can be started on all remaining endpoints.
(c) Modular enumeration and discovery process can be started by a command from
user space. After an enumeration/discovery module is loaded, a network scan
process can be started by issuing a user-space command.
Similar to the option (b) above, an enumerator has to be started first.
(d) Modular enumeration and discovery process can be started by a module
initialization routine. In this case an enumerating module shall be loaded
first.
When a network scan process is started it calls an enumeration or discovery
routine depending on the configured role of a master port: host or agent.
Enumeration is performed by a master port if it is configured as a host port by
assigning a host destination ID greater than or equal to zero. The host
destination ID can be assigned to a master port using various methods depending
on RapidIO subsystem build configuration:
(a) For a statically linked RapidIO subsystem core use command line parameter
"rapidio.hdid=" with a list of destination ID assignments in order of mport
device registration. For example, in a system with two RapidIO controllers
the command line parameter "rapidio.hdid=-1,7" will result in assignment of
the host destination ID=7 to the second RapidIO controller, while the first
one will be assigned destination ID=-1.
(b) If the RapidIO subsystem core is built as a loadable module, in addition
to the method shown above, the host destination ID(s) can be specified using
traditional methods of passing module parameter "hdid=" during its loading:
- from command line: "modprobe rapidio hdid=-1,7", or
- from modprobe configuration file using configuration command "options",
like in this example: "options rapidio hdid=-1,7". An example of modprobe
configuration file is provided in the section below.
NOTES:
(i) if "hdid=" parameter is omitted all available mport will be assigned
destination ID = -1;
(ii) the "hdid=" parameter in systems with multiple mports can have
destination ID assignments omitted from the end of list (default = -1).
If the host device ID for a specific master port is set to -1, the discovery
process will be performed for it.
The enumeration and discovery routines use RapidIO maintenance transactions
to access the configuration space of devices.
NOTE: If RapidIO switch-specific device drivers are built as loadable modules
they must be loaded before enumeration/discovery process starts.
This requirement is cased by the fact that enumeration/discovery methods invoke
vendor-specific callbacks on early stages.
4.2 Automatic Start of Enumeration and Discovery
------------------------------------------------
Automatic enumeration/discovery start method is applicable only to built-in
enumeration/discovery RapidIO configuration selection. To enable automatic
enumeration/discovery start by existing basic enumerator method set use boot
command line parameter "rio-scan.scan=1".
This configuration requires synchronized start of all RapidIO endpoints that
form a network which will be enumerated/discovered. Discovering endpoints have
to be started before an enumeration starts to ensure that all RapidIO
controllers have been initialized and are ready to be discovered. Configuration
parameter CONFIG_RAPIDIO_DISC_TIMEOUT defines time (in seconds) which
a discovering endpoint will wait for enumeration to be completed.
When automatic enumeration/discovery start is selected, basic method's
initialization routine calls rio_init_mports() to perform enumeration or
discovery for all known mport devices.
Depending on RapidIO network size and configuration this automatic
enumeration/discovery start method may be difficult to use due to the
requirement for synchronized start of all endpoints.
4.3 User-space Start of Enumeration and Discovery
-------------------------------------------------
User-space start of enumeration and discovery can be used with built-in and
modular build configurations. For user-space controlled start RapidIO subsystem
creates the sysfs write-only attribute file '/sys/bus/rapidio/scan'. To initiate
an enumeration or discovery process on specific mport device, a user needs to
write mport_ID (not RapidIO destination ID) into that file. The mport_ID is a
sequential number (0 ... RIO_MAX_MPORTS) assigned during mport device
registration. For example for machine with single RapidIO controller, mport_ID
for that controller always will be 0.
To initiate RapidIO enumeration/discovery on all available mports a user may
write '-1' (or RIO_MPORT_ANY) into the scan attribute file.
4.4 Basic Enumeration Method
----------------------------
This is an original enumeration/discovery method which is available since
first release of RapidIO subsystem code. The enumeration process is
implemented according to the enumeration algorithm outlined in the RapidIO
Interconnect Specification: Annex I [1].
This method can be configured as statically linked or loadable module.
The method's single parameter "scan" allows to trigger the enumeration/discovery
process from module initialization routine.
This enumeration/discovery method can be started only once and does not support
unloading if it is built as a module.
The enumeration process traverses the network using a recursive depth-first
algorithm. When a new device is found, the enumerator takes ownership of that
device by writing into the Host Device ID Lock CSR. It does this to ensure that
the enumerator has exclusive right to enumerate the device. If device ownership
is successfully acquired, the enumerator allocates a new rio_dev structure and
initializes it according to device capabilities.
If the device is an endpoint, a unique device ID is assigned to it and its value
is written into the device's Base Device ID CSR.
If the device is a switch, the enumerator allocates an additional rio_switch
structure to store switch specific information. Then the switch's vendor ID and
device ID are queried against a table of known RapidIO switches. Each switch
table entry contains a pointer to a switch-specific initialization routine that
initializes pointers to the rest of switch specific operations, and performs
hardware initialization if necessary. A RapidIO switch does not have a unique
device ID; it relies on hopcount and routing for device ID of an attached
endpoint if access to its configuration registers is required. If a switch (or
chain of switches) does not have any endpoint (except enumerator) attached to
it, a fake device ID will be assigned to configure a route to that switch.
In the case of a chain of switches without endpoint, one fake device ID is used
to configure a route through the entire chain and switches are differentiated by
their hopcount value.
For both endpoints and switches the enumerator writes a unique component tag
into device's Component Tag CSR. That unique value is used by the error
management notification mechanism to identify a device that is reporting an
error management event.
Enumeration beyond a switch is completed by iterating over each active egress
port of that switch. For each active link, a route to a default device ID
(0xFF for 8-bit systems and 0xFFFF for 16-bit systems) is temporarily written
into the routing table. The algorithm recurs by calling itself with hopcount + 1
and the default device ID in order to access the device on the active port.
After the host has completed enumeration of the entire network it releases
devices by clearing device ID locks (calls rio_clear_locks()). For each endpoint
in the system, it sets the Discovered bit in the Port General Control CSR
to indicate that enumeration is completed and agents are allowed to execute
passive discovery of the network.
The discovery process is performed by agents and is similar to the enumeration
process that is described above. However, the discovery process is performed
without changes to the existing routing because agents only gather information
about RapidIO network structure and are building an internal map of discovered
devices. This way each Linux-based component of the RapidIO subsystem has
a complete view of the network. The discovery process can be performed
simultaneously by several agents. After initializing its RapidIO master port
each agent waits for enumeration completion by the host for the configured wait
time period. If this wait time period expires before enumeration is completed,
an agent skips RapidIO discovery and continues with remaining kernel
initialization.
4.5 Adding New Enumeration/Discovery Method
-------------------------------------------
RapidIO subsystem code organization allows addition of new enumeration/discovery
methods as new configuration options without significant impact to the core
RapidIO code.
A new enumeration/discovery method has to be attached to one or more mport
devices before an enumeration/discovery process can be started. Normally,
method's module initialization routine calls rio_register_scan() to attach
an enumerator to a specified mport device (or devices). The basic enumerator
implementation demonstrates this process.
4.6 Using Loadable RapidIO Switch Drivers
-----------------------------------------
In the case when RapidIO switch drivers are built as loadable modules a user
must ensure that they are loaded before the enumeration/discovery starts.
This process can be automated by specifying pre- or post- dependencies in the
RapidIO-specific modprobe configuration file as shown in the example below.
File /etc/modprobe.d/rapidio.conf::
# Configure RapidIO subsystem modules
# Set enumerator host destination ID (overrides kernel command line option)
options rapidio hdid=-1,2
# Load RapidIO switch drivers immediately after rapidio core module was loaded
softdep rapidio post: idt_gen2 idtcps tsi57x
# OR :
# Load RapidIO switch drivers just before rio-scan enumerator module is loaded
softdep rio-scan pre: idt_gen2 idtcps tsi57x
--------------------------
NOTE:
In the example above, one of "softdep" commands must be removed or
commented out to keep required module loading sequence.
5. References
=============
[1] RapidIO Trade Association. RapidIO Interconnect Specifications.
http://www.rapidio.org.
[2] Rapidio TA. Technology Comparisons.
http://www.rapidio.org/education/technology_comparisons/
[3] RapidIO support for Linux.
http://lwn.net/Articles/139118/
[4] Matt Porter. RapidIO for Linux. Ottawa Linux Symposium, 2005
http://www.kernel.org/doc/ols/2005/ols2005v2-pages-43-56.pdf

View File

@@ -0,0 +1,135 @@
==========================================================================
RapidIO subsystem Channelized Messaging character device driver (rio_cm.c)
==========================================================================
1. Overview
===========
This device driver is the result of collaboration within the RapidIO.org
Software Task Group (STG) between Texas Instruments, Prodrive Technologies,
Nokia Networks, BAE and IDT. Additional input was received from other members
of RapidIO.org.
The objective was to create a character mode driver interface which exposes
messaging capabilities of RapidIO endpoint devices (mports) directly
to applications, in a manner that allows the numerous and varied RapidIO
implementations to interoperate.
This driver (RIO_CM) provides to user-space applications shared access to
RapidIO mailbox messaging resources.
RapidIO specification (Part 2) defines that endpoint devices may have up to four
messaging mailboxes in case of multi-packet message (up to 4KB) and
up to 64 mailboxes if single-packet messages (up to 256 B) are used. In addition
to protocol definition limitations, a particular hardware implementation can
have reduced number of messaging mailboxes. RapidIO aware applications must
therefore share the messaging resources of a RapidIO endpoint.
Main purpose of this device driver is to provide RapidIO mailbox messaging
capability to large number of user-space processes by introducing socket-like
operations using a single messaging mailbox. This allows applications to
use the limited RapidIO messaging hardware resources efficiently.
Most of device driver's operations are supported through 'ioctl' system calls.
When loaded this device driver creates a single file system node named rio_cm
in /dev directory common for all registered RapidIO mport devices.
Following ioctl commands are available to user-space applications:
- RIO_CM_MPORT_GET_LIST:
Returns to caller list of local mport devices that
support messaging operations (number of entries up to RIO_MAX_MPORTS).
Each list entry is combination of mport's index in the system and RapidIO
destination ID assigned to the port.
- RIO_CM_EP_GET_LIST_SIZE:
Returns number of messaging capable remote endpoints
in a RapidIO network associated with the specified mport device.
- RIO_CM_EP_GET_LIST:
Returns list of RapidIO destination IDs for messaging
capable remote endpoints (peers) available in a RapidIO network associated
with the specified mport device.
- RIO_CM_CHAN_CREATE:
Creates RapidIO message exchange channel data structure
with channel ID assigned automatically or as requested by a caller.
- RIO_CM_CHAN_BIND:
Binds the specified channel data structure to the specified
mport device.
- RIO_CM_CHAN_LISTEN:
Enables listening for connection requests on the specified
channel.
- RIO_CM_CHAN_ACCEPT:
Accepts a connection request from peer on the specified
channel. If wait timeout for this request is specified by a caller it is
a blocking call. If timeout set to 0 this is non-blocking call - ioctl
handler checks for a pending connection request and if one is not available
exits with -EGAIN error status immediately.
- RIO_CM_CHAN_CONNECT:
Sends a connection request to a remote peer/channel.
- RIO_CM_CHAN_SEND:
Sends a data message through the specified channel.
The handler for this request assumes that message buffer specified by
a caller includes the reserved space for a packet header required by
this driver.
- RIO_CM_CHAN_RECEIVE:
Receives a data message through a connected channel.
If the channel does not have an incoming message ready to return this ioctl
handler will wait for new message until timeout specified by a caller
expires. If timeout value is set to 0, ioctl handler uses a default value
defined by MAX_SCHEDULE_TIMEOUT.
- RIO_CM_CHAN_CLOSE:
Closes a specified channel and frees associated buffers.
If the specified channel is in the CONNECTED state, sends close notification
to the remote peer.
The ioctl command codes and corresponding data structures intended for use by
user-space applications are defined in 'include/uapi/linux/rio_cm_cdev.h'.
2. Hardware Compatibility
=========================
This device driver uses standard interfaces defined by kernel RapidIO subsystem
and therefore it can be used with any mport device driver registered by RapidIO
subsystem with limitations set by available mport HW implementation of messaging
mailboxes.
3. Module parameters
====================
- 'dbg_level'
- This parameter allows to control amount of debug information
generated by this device driver. This parameter is formed by set of
bit masks that correspond to the specific functional block.
For mask definitions see 'drivers/rapidio/devices/rio_cm.c'
This parameter can be changed dynamically.
Use CONFIG_RAPIDIO_DEBUG=y to enable debug output at the top level.
- 'cmbox'
- Number of RapidIO mailbox to use (default value is 1).
This parameter allows to set messaging mailbox number that will be used
within entire RapidIO network. It can be used when default mailbox is
used by other device drivers or is not supported by some nodes in the
RapidIO network.
- 'chstart'
- Start channel number for dynamic assignment. Default value - 256.
Allows to exclude channel numbers below this parameter from dynamic
allocation to avoid conflicts with software components that use
reserved predefined channel numbers.
4. Known problems
=================
None.
5. User-space Applications and API Library
==========================================
Messaging API library and applications that use this device driver are available
from RapidIO.org.
6. TODO List
============
- Add support for system notification messages (reserved channel 0).

View File

@@ -0,0 +1,7 @@
=============
Sysfs entries
=============
The RapidIO sysfs files have moved to:
Documentation/ABI/testing/sysfs-bus-rapidio and
Documentation/ABI/testing/sysfs-class-rapidio

View File

@@ -0,0 +1,112 @@
=========================================================================
RapidIO subsystem mport driver for IDT Tsi721 PCI Express-to-SRIO bridge.
=========================================================================
1. Overview
===========
This driver implements all currently defined RapidIO mport callback functions.
It supports maintenance read and write operations, inbound and outbound RapidIO
doorbells, inbound maintenance port-writes and RapidIO messaging.
To generate SRIO maintenance transactions this driver uses one of Tsi721 DMA
channels. This mechanism provides access to larger range of hop counts and
destination IDs without need for changes in outbound window translation.
RapidIO messaging support uses dedicated messaging channels for each mailbox.
For inbound messages this driver uses destination ID matching to forward messages
into the corresponding message queue. Messaging callbacks are implemented to be
fully compatible with RIONET driver (Ethernet over RapidIO messaging services).
1. Module parameters:
- 'dbg_level'
- This parameter allows to control amount of debug information
generated by this device driver. This parameter is formed by set of
This parameter can be changed bit masks that correspond to the specific
functional block.
For mask definitions see 'drivers/rapidio/devices/tsi721.h'
This parameter can be changed dynamically.
Use CONFIG_RAPIDIO_DEBUG=y to enable debug output at the top level.
- 'dma_desc_per_channel'
- This parameter defines number of hardware buffer
descriptors allocated for each registered Tsi721 DMA channel.
Its default value is 128.
- 'dma_txqueue_sz'
- DMA transactions queue size. Defines number of pending
transaction requests that can be accepted by each DMA channel.
Default value is 16.
- 'dma_sel'
- DMA channel selection mask. Bitmask that defines which hardware
DMA channels (0 ... 6) will be registered with DmaEngine core.
If bit is set to 1, the corresponding DMA channel will be registered.
DMA channels not selected by this mask will not be used by this device
driver. Default value is 0x7f (use all channels).
- 'pcie_mrrs'
- override value for PCIe Maximum Read Request Size (MRRS).
This parameter gives an ability to override MRRS value set during PCIe
configuration process. Tsi721 supports read request sizes up to 4096B.
Value for this parameter must be set as defined by PCIe specification:
0 = 128B, 1 = 256B, 2 = 512B, 3 = 1024B, 4 = 2048B and 5 = 4096B.
Default value is '-1' (= keep platform setting).
- 'mbox_sel'
- RIO messaging MBOX selection mask. This is a bitmask that defines
messaging MBOXes are managed by this device driver. Mask bits 0 - 3
correspond to MBOX0 - MBOX3. MBOX is under driver's control if the
corresponding bit is set to '1'. Default value is 0x0f (= all).
2. Known problems
=================
None.
3. DMA Engine Support
=====================
Tsi721 mport driver supports DMA data transfers between local system memory and
remote RapidIO devices. This functionality is implemented according to SLAVE
mode API defined by common Linux kernel DMA Engine framework.
Depending on system requirements RapidIO DMA operations can be included/excluded
by setting CONFIG_RAPIDIO_DMA_ENGINE option. Tsi721 miniport driver uses seven
out of eight available BDMA channels to support DMA data transfers.
One BDMA channel is reserved for generation of maintenance read/write requests.
If Tsi721 mport driver have been built with RAPIDIO_DMA_ENGINE support included,
this driver will accept DMA-specific module parameter:
"dma_desc_per_channel"
- defines number of hardware buffer descriptors used by
each BDMA channel of Tsi721 (by default - 128).
4. Version History
===== ====================================================================
1.1.0 DMA operations re-worked to support data scatter/gather lists larger
than hardware buffer descriptors ring.
1.0.0 Initial driver release.
===== ====================================================================
5. License
===========
Copyright(c) 2011 Integrated Device Technology, Inc. All rights reserved.
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your option)
any later version.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc.,
59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

View File

@@ -0,0 +1,132 @@
===============================
rfkill - RF kill switch support
===============================
.. contents::
:depth: 2
Introduction
============
The rfkill subsystem provides a generic interface for disabling any radio
transmitter in the system. When a transmitter is blocked, it shall not
radiate any power.
The subsystem also provides the ability to react on button presses and
disable all transmitters of a certain type (or all). This is intended for
situations where transmitters need to be turned off, for example on
aircraft.
The rfkill subsystem has a concept of "hard" and "soft" block, which
differ little in their meaning (block == transmitters off) but rather in
whether they can be changed or not:
- hard block
read-only radio block that cannot be overridden by software
- soft block
writable radio block (need not be readable) that is set by
the system software.
The rfkill subsystem has two parameters, rfkill.default_state and
rfkill.master_switch_mode, which are documented in
admin-guide/kernel-parameters.rst.
Implementation details
======================
The rfkill subsystem is composed of three main components:
* the rfkill core,
* the deprecated rfkill-input module (an input layer handler, being
replaced by userspace policy code) and
* the rfkill drivers.
The rfkill core provides API for kernel drivers to register their radio
transmitter with the kernel, methods for turning it on and off, and letting
the system know about hardware-disabled states that may be implemented on
the device.
The rfkill core code also notifies userspace of state changes, and provides
ways for userspace to query the current states. See the "Userspace support"
section below.
When the device is hard-blocked (either by a call to rfkill_set_hw_state()
or from query_hw_block), set_block() will be invoked for additional software
block, but drivers can ignore the method call since they can use the return
value of the function rfkill_set_hw_state() to sync the software state
instead of keeping track of calls to set_block(). In fact, drivers should
use the return value of rfkill_set_hw_state() unless the hardware actually
keeps track of soft and hard block separately.
Kernel API
==========
Drivers for radio transmitters normally implement an rfkill driver.
Platform drivers might implement input devices if the rfkill button is just
that, a button. If that button influences the hardware then you need to
implement an rfkill driver instead. This also applies if the platform provides
a way to turn on/off the transmitter(s).
For some platforms, it is possible that the hardware state changes during
suspend/hibernation, in which case it will be necessary to update the rfkill
core with the current state at resume time.
To create an rfkill driver, driver's Kconfig needs to have::
depends on RFKILL || !RFKILL
to ensure the driver cannot be built-in when rfkill is modular. The !RFKILL
case allows the driver to be built when rfkill is not configured, in which
case all rfkill API can still be used but will be provided by static inlines
which compile to almost nothing.
Calling rfkill_set_hw_state() when a state change happens is required from
rfkill drivers that control devices that can be hard-blocked unless they also
assign the poll_hw_block() callback (then the rfkill core will poll the
device). Don't do this unless you cannot get the event in any other way.
rfkill provides per-switch LED triggers, which can be used to drive LEDs
according to the switch state (LED_FULL when blocked, LED_OFF otherwise).
Userspace support
=================
The recommended userspace interface to use is /dev/rfkill, which is a misc
character device that allows userspace to obtain and set the state of rfkill
devices and sets of devices. It also notifies userspace about device addition
and removal. The API is a simple read/write API that is defined in
linux/rfkill.h, with one ioctl that allows turning off the deprecated input
handler in the kernel for the transition period.
Except for the one ioctl, communication with the kernel is done via read()
and write() of instances of 'struct rfkill_event'. In this structure, the
soft and hard block are properly separated (unlike sysfs, see below) and
userspace is able to get a consistent snapshot of all rfkill devices in the
system. Also, it is possible to switch all rfkill drivers (or all drivers of
a specified type) into a state which also updates the default state for
hotplugged devices.
After an application opens /dev/rfkill, it can read the current state of all
devices. Changes can be obtained by either polling the descriptor for
hotplug or state change events or by listening for uevents emitted by the
rfkill core framework.
Additionally, each rfkill device is registered in sysfs and emits uevents.
rfkill devices issue uevents (with an action of "change"), with the following
environment variables set::
RFKILL_NAME
RFKILL_STATE
RFKILL_TYPE
The content of these variables corresponds to the "name", "state" and
"type" sysfs files explained above.
For further details consult Documentation/ABI/stable/sysfs-class-rfkill.

View File

@@ -0,0 +1,11 @@
================
Cyclades-Z notes
================
The Cyclades-Z must have firmware loaded onto the card before it will
operate. This operation should be performed during system startup,
The firmware, loader program and the latest device driver code are
available from Cyclades at
ftp://ftp.cyclades.com/pub/cyclades/cyclades-z/linux/

View File

@@ -0,0 +1,549 @@
====================
Low Level Serial API
====================
This document is meant as a brief overview of some aspects of the new serial
driver. It is not complete, any questions you have should be directed to
<rmk@arm.linux.org.uk>
The reference implementation is contained within amba-pl011.c.
Low Level Serial Hardware Driver
--------------------------------
The low level serial hardware driver is responsible for supplying port
information (defined by uart_port) and a set of control methods (defined
by uart_ops) to the core serial driver. The low level driver is also
responsible for handling interrupts for the port, and providing any
console support.
Console Support
---------------
The serial core provides a few helper functions. This includes identifing
the correct port structure (via uart_get_console) and decoding command line
arguments (uart_parse_options).
There is also a helper function (uart_console_write) which performs a
character by character write, translating newlines to CRLF sequences.
Driver writers are recommended to use this function rather than implementing
their own version.
Locking
-------
It is the responsibility of the low level hardware driver to perform the
necessary locking using port->lock. There are some exceptions (which
are described in the uart_ops listing below.)
There are two locks. A per-port spinlock, and an overall semaphore.
From the core driver perspective, the port->lock locks the following
data::
port->mctrl
port->icount
port->state->xmit.head (circ_buf->head)
port->state->xmit.tail (circ_buf->tail)
The low level driver is free to use this lock to provide any additional
locking.
The port_sem semaphore is used to protect against ports being added/
removed or reconfigured at inappropriate times. Since v2.6.27, this
semaphore has been the 'mutex' member of the tty_port struct, and
commonly referred to as the port mutex.
uart_ops
--------
The uart_ops structure is the main interface between serial_core and the
hardware specific driver. It contains all the methods to control the
hardware.
tx_empty(port)
This function tests whether the transmitter fifo and shifter
for the port described by 'port' is empty. If it is empty,
this function should return TIOCSER_TEMT, otherwise return 0.
If the port does not support this operation, then it should
return TIOCSER_TEMT.
Locking: none.
Interrupts: caller dependent.
This call must not sleep
set_mctrl(port, mctrl)
This function sets the modem control lines for port described
by 'port' to the state described by mctrl. The relevant bits
of mctrl are:
- TIOCM_RTS RTS signal.
- TIOCM_DTR DTR signal.
- TIOCM_OUT1 OUT1 signal.
- TIOCM_OUT2 OUT2 signal.
- TIOCM_LOOP Set the port into loopback mode.
If the appropriate bit is set, the signal should be driven
active. If the bit is clear, the signal should be driven
inactive.
Locking: port->lock taken.
Interrupts: locally disabled.
This call must not sleep
get_mctrl(port)
Returns the current state of modem control inputs. The state
of the outputs should not be returned, since the core keeps
track of their state. The state information should include:
- TIOCM_CAR state of DCD signal
- TIOCM_CTS state of CTS signal
- TIOCM_DSR state of DSR signal
- TIOCM_RI state of RI signal
The bit is set if the signal is currently driven active. If
the port does not support CTS, DCD or DSR, the driver should
indicate that the signal is permanently active. If RI is
not available, the signal should not be indicated as active.
Locking: port->lock taken.
Interrupts: locally disabled.
This call must not sleep
stop_tx(port)
Stop transmitting characters. This might be due to the CTS
line becoming inactive or the tty layer indicating we want
to stop transmission due to an XOFF character.
The driver should stop transmitting characters as soon as
possible.
Locking: port->lock taken.
Interrupts: locally disabled.
This call must not sleep
start_tx(port)
Start transmitting characters.
Locking: port->lock taken.
Interrupts: locally disabled.
This call must not sleep
throttle(port)
Notify the serial driver that input buffers for the line discipline are
close to full, and it should somehow signal that no more characters
should be sent to the serial port.
This will be called only if hardware assisted flow control is enabled.
Locking: serialized with .unthrottle() and termios modification by the
tty layer.
unthrottle(port)
Notify the serial driver that characters can now be sent to the serial
port without fear of overrunning the input buffers of the line
disciplines.
This will be called only if hardware assisted flow control is enabled.
Locking: serialized with .throttle() and termios modification by the
tty layer.
send_xchar(port,ch)
Transmit a high priority character, even if the port is stopped.
This is used to implement XON/XOFF flow control and tcflow(). If
the serial driver does not implement this function, the tty core
will append the character to the circular buffer and then call
start_tx() / stop_tx() to flush the data out.
Do not transmit if ch == '\0' (__DISABLED_CHAR).
Locking: none.
Interrupts: caller dependent.
stop_rx(port)
Stop receiving characters; the port is in the process of
being closed.
Locking: port->lock taken.
Interrupts: locally disabled.
This call must not sleep
enable_ms(port)
Enable the modem status interrupts.
This method may be called multiple times. Modem status
interrupts should be disabled when the shutdown method is
called.
Locking: port->lock taken.
Interrupts: locally disabled.
This call must not sleep
break_ctl(port,ctl)
Control the transmission of a break signal. If ctl is
nonzero, the break signal should be transmitted. The signal
should be terminated when another call is made with a zero
ctl.
Locking: caller holds tty_port->mutex
startup(port)
Grab any interrupt resources and initialise any low level driver
state. Enable the port for reception. It should not activate
RTS nor DTR; this will be done via a separate call to set_mctrl.
This method will only be called when the port is initially opened.
Locking: port_sem taken.
Interrupts: globally disabled.
shutdown(port)
Disable the port, disable any break condition that may be in
effect, and free any interrupt resources. It should not disable
RTS nor DTR; this will have already been done via a separate
call to set_mctrl.
Drivers must not access port->state once this call has completed.
This method will only be called when there are no more users of
this port.
Locking: port_sem taken.
Interrupts: caller dependent.
flush_buffer(port)
Flush any write buffers, reset any DMA state and stop any
ongoing DMA transfers.
This will be called whenever the port->state->xmit circular
buffer is cleared.
Locking: port->lock taken.
Interrupts: locally disabled.
This call must not sleep
set_termios(port,termios,oldtermios)
Change the port parameters, including word length, parity, stop
bits. Update read_status_mask and ignore_status_mask to indicate
the types of events we are interested in receiving. Relevant
termios->c_cflag bits are:
CSIZE
- word size
CSTOPB
- 2 stop bits
PARENB
- parity enable
PARODD
- odd parity (when PARENB is in force)
CREAD
- enable reception of characters (if not set,
still receive characters from the port, but
throw them away.
CRTSCTS
- if set, enable CTS status change reporting
CLOCAL
- if not set, enable modem status change
reporting.
Relevant termios->c_iflag bits are:
INPCK
- enable frame and parity error events to be
passed to the TTY layer.
BRKINT / PARMRK
- both of these enable break events to be
passed to the TTY layer.
IGNPAR
- ignore parity and framing errors
IGNBRK
- ignore break errors, If IGNPAR is also
set, ignore overrun errors as well.
The interaction of the iflag bits is as follows (parity error
given as an example):
=============== ======= ====== =============================
Parity error INPCK IGNPAR
=============== ======= ====== =============================
n/a 0 n/a character received, marked as
TTY_NORMAL
None 1 n/a character received, marked as
TTY_NORMAL
Yes 1 0 character received, marked as
TTY_PARITY
Yes 1 1 character discarded
=============== ======= ====== =============================
Other flags may be used (eg, xon/xoff characters) if your
hardware supports hardware "soft" flow control.
Locking: caller holds tty_port->mutex
Interrupts: caller dependent.
This call must not sleep
set_ldisc(port,termios)
Notifier for discipline change. See Documentation/driver-api/serial/tty.rst.
Locking: caller holds tty_port->mutex
pm(port,state,oldstate)
Perform any power management related activities on the specified
port. State indicates the new state (defined by
enum uart_pm_state), oldstate indicates the previous state.
This function should not be used to grab any resources.
This will be called when the port is initially opened and finally
closed, except when the port is also the system console. This
will occur even if CONFIG_PM is not set.
Locking: none.
Interrupts: caller dependent.
type(port)
Return a pointer to a string constant describing the specified
port, or return NULL, in which case the string 'unknown' is
substituted.
Locking: none.
Interrupts: caller dependent.
release_port(port)
Release any memory and IO region resources currently in use by
the port.
Locking: none.
Interrupts: caller dependent.
request_port(port)
Request any memory and IO region resources required by the port.
If any fail, no resources should be registered when this function
returns, and it should return -EBUSY on failure.
Locking: none.
Interrupts: caller dependent.
config_port(port,type)
Perform any autoconfiguration steps required for the port. `type`
contains a bit mask of the required configuration. UART_CONFIG_TYPE
indicates that the port requires detection and identification.
port->type should be set to the type found, or PORT_UNKNOWN if
no port was detected.
UART_CONFIG_IRQ indicates autoconfiguration of the interrupt signal,
which should be probed using standard kernel autoprobing techniques.
This is not necessary on platforms where ports have interrupts
internally hard wired (eg, system on a chip implementations).
Locking: none.
Interrupts: caller dependent.
verify_port(port,serinfo)
Verify the new serial port information contained within serinfo is
suitable for this port type.
Locking: none.
Interrupts: caller dependent.
ioctl(port,cmd,arg)
Perform any port specific IOCTLs. IOCTL commands must be defined
using the standard numbering system found in <asm/ioctl.h>
Locking: none.
Interrupts: caller dependent.
poll_init(port)
Called by kgdb to perform the minimal hardware initialization needed
to support poll_put_char() and poll_get_char(). Unlike ->startup()
this should not request interrupts.
Locking: tty_mutex and tty_port->mutex taken.
Interrupts: n/a.
poll_put_char(port,ch)
Called by kgdb to write a single character directly to the serial
port. It can and should block until there is space in the TX FIFO.
Locking: none.
Interrupts: caller dependent.
This call must not sleep
poll_get_char(port)
Called by kgdb to read a single character directly from the serial
port. If data is available, it should be returned; otherwise
the function should return NO_POLL_CHAR immediately.
Locking: none.
Interrupts: caller dependent.
This call must not sleep
Other functions
---------------
uart_update_timeout(port,cflag,baud)
Update the FIFO drain timeout, port->timeout, according to the
number of bits, parity, stop bits and baud rate.
Locking: caller is expected to take port->lock
Interrupts: n/a
uart_get_baud_rate(port,termios,old,min,max)
Return the numeric baud rate for the specified termios, taking
account of the special 38400 baud "kludge". The B0 baud rate
is mapped to 9600 baud.
If the baud rate is not within min..max, then if old is non-NULL,
the original baud rate will be tried. If that exceeds the
min..max constraint, 9600 baud will be returned. termios will
be updated to the baud rate in use.
Note: min..max must always allow 9600 baud to be selected.
Locking: caller dependent.
Interrupts: n/a
uart_get_divisor(port,baud)
Return the divisor (baud_base / baud) for the specified baud
rate, appropriately rounded.
If 38400 baud and custom divisor is selected, return the
custom divisor instead.
Locking: caller dependent.
Interrupts: n/a
uart_match_port(port1,port2)
This utility function can be used to determine whether two
uart_port structures describe the same port.
Locking: n/a
Interrupts: n/a
uart_write_wakeup(port)
A driver is expected to call this function when the number of
characters in the transmit buffer have dropped below a threshold.
Locking: port->lock should be held.
Interrupts: n/a
uart_register_driver(drv)
Register a uart driver with the core driver. We in turn register
with the tty layer, and initialise the core driver per-port state.
drv->port should be NULL, and the per-port structures should be
registered using uart_add_one_port after this call has succeeded.
Locking: none
Interrupts: enabled
uart_unregister_driver()
Remove all references to a driver from the core driver. The low
level driver must have removed all its ports via the
uart_remove_one_port() if it registered them with uart_add_one_port().
Locking: none
Interrupts: enabled
**uart_suspend_port()**
**uart_resume_port()**
**uart_add_one_port()**
**uart_remove_one_port()**
Other notes
-----------
It is intended some day to drop the 'unused' entries from uart_port, and
allow low level drivers to register their own individual uart_port's with
the core. This will allow drivers to use uart_port as a pointer to a
structure containing both the uart_port entry with their own extensions,
thus::
struct my_port {
struct uart_port port;
int my_stuff;
};
Modem control lines via GPIO
----------------------------
Some helpers are provided in order to set/get modem control lines via GPIO.
mctrl_gpio_init(port, idx):
This will get the {cts,rts,...}-gpios from device tree if they are
present and request them, set direction etc, and return an
allocated structure. `devm_*` functions are used, so there's no need
to call mctrl_gpio_free().
As this sets up the irq handling make sure to not handle changes to the
gpio input lines in your driver, too.
mctrl_gpio_free(dev, gpios):
This will free the requested gpios in mctrl_gpio_init().
As `devm_*` functions are used, there's generally no need to call
this function.
mctrl_gpio_to_gpiod(gpios, gidx)
This returns the gpio_desc structure associated to the modem line
index.
mctrl_gpio_set(gpios, mctrl):
This will sets the gpios according to the mctrl state.
mctrl_gpio_get(gpios, mctrl):
This will update mctrl with the gpios values.
mctrl_gpio_enable_ms(gpios):
Enables irqs and handling of changes to the ms lines.
mctrl_gpio_disable_ms(gpios):
Disables irqs and handling of changes to the ms lines.

View File

@@ -0,0 +1,32 @@
.. SPDX-License-Identifier: GPL-2.0
==========================
Support for Serial devices
==========================
.. toctree::
:maxdepth: 1
driver
tty
Serial drivers
==============
.. toctree::
:maxdepth: 1
cyclades_z
moxa-smartio
n_gsm
rocket
serial-iso7816
serial-rs485
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@@ -0,0 +1,615 @@
=============================================================
MOXA Smartio/Industio Family Device Driver Installation Guide
=============================================================
.. note::
This file is outdated. It needs some care in order to make it
updated to Kernel 5.0 and upper
Copyright (C) 2008, Moxa Inc.
Date: 01/21/2008
.. Content
1. Introduction
2. System Requirement
3. Installation
3.1 Hardware installation
3.2 Driver files
3.3 Device naming convention
3.4 Module driver configuration
3.5 Static driver configuration for Linux kernel 2.4.x and 2.6.x.
3.6 Custom configuration
3.7 Verify driver installation
4. Utilities
5. Setserial
6. Troubleshooting
1. Introduction
^^^^^^^^^^^^^^^
The Smartio/Industio/UPCI family Linux driver supports following multiport
boards.
- 2 ports multiport board
CP-102U, CP-102UL, CP-102UF
CP-132U-I, CP-132UL,
CP-132, CP-132I, CP132S, CP-132IS,
CI-132, CI-132I, CI-132IS,
(C102H, C102HI, C102HIS, C102P, CP-102, CP-102S)
- 4 ports multiport board
CP-104EL,
CP-104UL, CP-104JU,
CP-134U, CP-134U-I,
C104H/PCI, C104HS/PCI,
CP-114, CP-114I, CP-114S, CP-114IS, CP-114UL,
C104H, C104HS,
CI-104J, CI-104JS,
CI-134, CI-134I, CI-134IS,
(C114HI, CT-114I, C104P),
POS-104UL,
CB-114,
CB-134I
- 8 ports multiport board
CP-118EL, CP-168EL,
CP-118U, CP-168U,
C168H/PCI,
C168H, C168HS,
(C168P),
CB-108
This driver and installation procedure have been developed upon Linux Kernel
2.4.x and 2.6.x. This driver supports Intel x86 hardware platform. In order
to maintain compatibility, this version has also been properly tested with
RedHat, Mandrake, Fedora and S.u.S.E Linux. However, if compatibility problem
occurs, please contact Moxa at support@moxa.com.tw.
In addition to device driver, useful utilities are also provided in this
version. They are:
- msdiag
Diagnostic program for displaying installed Moxa
Smartio/Industio boards.
- msmon
Monitor program to observe data count and line status signals.
- msterm A simple terminal program which is useful in testing serial
ports.
- io-irq.exe
Configuration program to setup ISA boards. Please note that
this program can only be executed under DOS.
All the drivers and utilities are published in form of source code under
GNU General Public License in this version. Please refer to GNU General
Public License announcement in each source code file for more detail.
In Moxa's Web sites, you may always find latest driver at http://www.moxa.com/.
This version of driver can be installed as Loadable Module (Module driver)
or built-in into kernel (Static driver). You may refer to following
installation procedure for suitable one. Before you install the driver,
please refer to hardware installation procedure in the User's Manual.
We assume the user should be familiar with following documents.
- Serial-HOWTO
- Kernel-HOWTO
2. System Requirement
^^^^^^^^^^^^^^^^^^^^^
- Hardware platform: Intel x86 machine
- Kernel version: 2.4.x or 2.6.x
- gcc version 2.72 or later
- Maximum 4 boards can be installed in combination
3. Installation
^^^^^^^^^^^^^^^
3.1 Hardware installation
=========================
There are two types of buses, ISA and PCI, for Smartio/Industio
family multiport board.
ISA board
---------
You'll have to configure CAP address, I/O address, Interrupt Vector
as well as IRQ before installing this driver. Please refer to hardware
installation procedure in User's Manual before proceed any further.
Please make sure the JP1 is open after the ISA board is set properly.
PCI/UPCI board
--------------
You may need to adjust IRQ usage in BIOS to avoid from IRQ conflict
with other ISA devices. Please refer to hardware installation
procedure in User's Manual in advance.
PCI IRQ Sharing
---------------
Each port within the same multiport board shares the same IRQ. Up to
4 Moxa Smartio/Industio PCI Family multiport boards can be installed
together on one system and they can share the same IRQ.
3.2 Driver files
================
The driver file may be obtained from ftp, CD-ROM or floppy disk. The
first step, anyway, is to copy driver file "mxser.tgz" into specified
directory. e.g. /moxa. The execute commands as below::
# cd /
# mkdir moxa
# cd /moxa
# tar xvf /dev/fd0
or::
# cd /
# mkdir moxa
# cd /moxa
# cp /mnt/cdrom/<driver directory>/mxser.tgz .
# tar xvfz mxser.tgz
3.3 Device naming convention
============================
You may find all the driver and utilities files in /moxa/mxser.
Following installation procedure depends on the model you'd like to
run the driver. If you prefer module driver, please refer to 3.4.
If static driver is required, please refer to 3.5.
Dialin and callout port
-----------------------
This driver remains traditional serial device properties. There are
two special file name for each serial port. One is dial-in port
which is named "ttyMxx". For callout port, the naming convention
is "cumxx".
Device naming when more than 2 boards installed
-----------------------------------------------
Naming convention for each Smartio/Industio multiport board is
pre-defined as below.
============ =============== ==============
Board Num. Dial-in Port Callout port
1st board ttyM0 - ttyM7 cum0 - cum7
2nd board ttyM8 - ttyM15 cum8 - cum15
3rd board ttyM16 - ttyM23 cum16 - cum23
4th board ttyM24 - ttym31 cum24 - cum31
============ =============== ==============
.. note::
Under Kernel 2.6 and upper, the cum Device is Obsolete. So use ttyM*
device instead.
Board sequence
--------------
This driver will activate ISA boards according to the parameter set
in the driver. After all specified ISA board activated, PCI board
will be installed in the system automatically driven.
Therefore the board number is sorted by the CAP address of ISA boards.
For PCI boards, their sequence will be after ISA boards and C168H/PCI
has higher priority than C104H/PCI boards.
3.4 Module driver configuration
===============================
Module driver is easiest way to install. If you prefer static driver
installation, please skip this paragraph.
------------- Prepare to use the MOXA driver --------------------
3.4.1 Create tty device with correct major number
-------------------------------------------------
Before using MOXA driver, your system must have the tty devices
which are created with driver's major number. We offer one shell
script "msmknod" to simplify the procedure.
This step is only needed to be executed once. But you still
need to do this procedure when:
a. You change the driver's major number. Please refer the "3.7"
section.
b. Your total installed MOXA boards number is changed. Maybe you
add/delete one MOXA board.
c. You want to change the tty name. This needs to modify the
shell script "msmknod"
The procedure is::
# cd /moxa/mxser/driver
# ./msmknod
This shell script will require the major number for dial-in
device and callout device to create tty device. You also need
to specify the total installed MOXA board number. Default major
numbers for dial-in device and callout device are 30, 35. If
you need to change to other number, please refer section "3.7"
for more detailed procedure.
Msmknod will delete any special files occupying the same device
naming.
3.4.2 Build the MOXA driver and utilities
-----------------------------------------
Before using the MOXA driver and utilities, you need compile the
all the source code. This step is only need to be executed once.
But you still re-compile the source code if you modify the source
code. For example, if you change the driver's major number (see
"3.7" section), then you need to do this step again.
Find "Makefile" in /moxa/mxser, then run
# make clean; make install
..note::
For Red Hat 9, Red Hat Enterprise Linux AS3/ES3/WS3 & Fedora Core1:
# make clean; make installsp1
For Red Hat Enterprise Linux AS4/ES4/WS4:
# make clean; make installsp2
The driver files "mxser.o" and utilities will be properly compiled
and copied to system directories respectively.
------------- Load MOXA driver--------------------
3.4.3 Load the MOXA driver
--------------------------
::
# modprobe mxser <argument>
will activate the module driver. You may run "lsmod" to check
if "mxser" is activated. If the MOXA board is ISA board, the
<argument> is needed. Please refer to section "3.4.5" for more
information.
------------- Load MOXA driver on boot --------------------
3.4.4 Load the mxser driver
---------------------------
For the above description, you may manually execute
"modprobe mxser" to activate this driver and run
"rmmod mxser" to remove it.
However, it's better to have a boot time configuration to
eliminate manual operation. Boot time configuration can be
achieved by rc file. We offer one "rc.mxser" file to simplify
the procedure under "moxa/mxser/driver".
But if you use ISA board, please modify the "modprobe ..." command
to add the argument (see "3.4.5" section). After modifying the
rc.mxser, please try to execute "/moxa/mxser/driver/rc.mxser"
manually to make sure the modification is ok. If any error
encountered, please try to modify again. If the modification is
completed, follow the below step.
Run following command for setting rc files::
# cd /moxa/mxser/driver
# cp ./rc.mxser /etc/rc.d
# cd /etc/rc.d
Check "rc.serial" is existed or not. If "rc.serial" doesn't exist,
create it by vi, run "chmod 755 rc.serial" to change the permission.
Add "/etc/rc.d/rc.mxser" in last line.
Reboot and check if moxa.o activated by "lsmod" command.
3.4.5. specify CAP address
--------------------------
If you'd like to drive Smartio/Industio ISA boards in the system,
you'll have to add parameter to specify CAP address of given
board while activating "mxser.o". The format for parameters are
as follows.::
modprobe mxser ioaddr=0x???,0x???,0x???,0x???
| | | |
| | | +- 4th ISA board
| | +------ 3rd ISA board
| +------------ 2nd ISA board
+-------------------1st ISA board
3.5 Static driver configuration for Linux kernel 2.4.x and 2.6.x
================================================================
Note:
To use static driver, you must install the linux kernel
source package.
3.5.1 Backup the built-in driver in the kernel
----------------------------------------------
::
# cd /usr/src/linux/drivers/char
# mv mxser.c mxser.c.old
For Red Hat 7.x user, you need to create link:
# cd /usr/src
# ln -s linux-2.4 linux
3.5.2 Create link
-----------------
::
# cd /usr/src/linux/drivers/char
# ln -s /moxa/mxser/driver/mxser.c mxser.c
3.5.3 Add CAP address list for ISA boards.
------------------------------------------
For PCI boards user, please skip this step.
In module mode, the CAP address for ISA board is given by
parameter. In static driver configuration, you'll have to
assign it within driver's source code. If you will not
install any ISA boards, you may skip to next portion.
The instructions to modify driver source code are as
below.
a. run::
# cd /moxa/mxser/driver
# vi mxser.c
b. Find the array mxserBoardCAP[] as below::
static int mxserBoardCAP[] = {0x00, 0x00, 0x00, 0x00};
c. Change the address within this array using vi. For
example, to driver 2 ISA boards with CAP address
0x280 and 0x180 as 1st and 2nd board. Just to change
the source code as follows::
static int mxserBoardCAP[] = {0x280, 0x180, 0x00, 0x00};
3.5.4 Setup kernel configuration
--------------------------------
Configure the kernel::
# cd /usr/src/linux
# make menuconfig
You will go into a menu-driven system. Please select [Character
devices][Non-standard serial port support], enable the [Moxa
SmartIO support] driver with "[*]" for built-in (not "[M]"), then
select [Exit] to exit this program.
3.5.5 Rebuild kernel
--------------------
The following are for Linux kernel rebuilding, for your
reference only.
For appropriate details, please refer to the Linux document:
a. Run the following commands::
cd /usr/src/linux
make clean # take a few minutes
make dep # take a few minutes
make bzImage # take probably 10-20 minutes
make install # copy boot image to correct position
f. Please make sure the boot kernel (vmlinuz) is in the
correct position.
g. If you use 'lilo' utility, you should check /etc/lilo.conf
'image' item specified the path which is the 'vmlinuz' path,
or you will load wrong (or old) boot kernel image (vmlinuz).
After checking /etc/lilo.conf, please run "lilo".
Note that if the result of "make bzImage" is ERROR, then you have to
go back to Linux configuration Setup. Type "make menuconfig" in
directory /usr/src/linux.
3.5.6 Make tty device and special file
--------------------------------------
::
# cd /moxa/mxser/driver
# ./msmknod
3.5.7 Make utility
------------------
::
# cd /moxa/mxser/utility
# make clean; make install
3.5.8 Reboot
------------
3.6 Custom configuration
========================
Although this driver already provides you default configuration, you
still can change the device name and major number. The instruction to
change these parameters are shown as below.
a. Change Device name
If you'd like to use other device names instead of default naming
convention, all you have to do is to modify the internal code
within the shell script "msmknod". First, you have to open "msmknod"
by vi. Locate each line contains "ttyM" and "cum" and change them
to the device name you desired. "msmknod" creates the device names
you need next time executed.
b. Change Major number
If major number 30 and 35 had been occupied, you may have to select
2 free major numbers for this driver. There are 3 steps to change
major numbers.
3.6.1 Find free major numbers
-----------------------------
In /proc/devices, you may find all the major numbers occupied
in the system. Please select 2 major numbers that are available.
e.g. 40, 45.
3.6.2 Create special files
--------------------------
Run /moxa/mxser/driver/msmknod to create special files with
specified major numbers.
3.6.3 Modify driver with new major number
-----------------------------------------
Run vi to open /moxa/mxser/driver/mxser.c. Locate the line
contains "MXSERMAJOR". Change the content as below::
#define MXSERMAJOR 40
#define MXSERCUMAJOR 45
3.6.4 Run "make clean; make install" in /moxa/mxser/driver.
3.7 Verify driver installation
==============================
You may refer to /var/log/messages to check the latest status
log reported by this driver whenever it's activated.
4. Utilities
^^^^^^^^^^^^
There are 3 utilities contained in this driver. They are msdiag, msmon and
msterm. These 3 utilities are released in form of source code. They should
be compiled into executable file and copied into /usr/bin.
Before using these utilities, please load driver (refer 3.4 & 3.5) and
make sure you had run the "msmknod" utility.
msdiag - Diagnostic
===================
This utility provides the function to display what Moxa Smartio/Industio
board found by driver in the system.
msmon - Port Monitoring
=======================
This utility gives the user a quick view about all the MOXA ports'
activities. One can easily learn each port's total received/transmitted
(Rx/Tx) character count since the time when the monitoring is started.
Rx/Tx throughputs per second are also reported in interval basis (e.g.
the last 5 seconds) and in average basis (since the time the monitoring
is started). You can reset all ports' count by <HOME> key. <+> <->
(plus/minus) keys to change the displaying time interval. Press <ENTER>
on the port, that cursor stay, to view the port's communication
parameters, signal status, and input/output queue.
msterm - Terminal Emulation
===========================
This utility provides data sending and receiving ability of all tty ports,
especially for MOXA ports. It is quite useful for testing simple
application, for example, sending AT command to a modem connected to the
port or used as a terminal for login purpose. Note that this is only a
dumb terminal emulation without handling full screen operation.
5. Setserial
^^^^^^^^^^^^
Supported Setserial parameters are listed as below.
============== =========================================================
uart set UART type(16450-->disable FIFO, 16550A-->enable FIFO)
close_delay set the amount of time(in 1/100 of a second) that DTR
should be kept low while being closed.
closing_wait set the amount of time(in 1/100 of a second) that the
serial port should wait for data to be drained while
being closed, before the receiver is disable.
spd_hi Use 57.6kb when the application requests 38.4kb.
spd_vhi Use 115.2kb when the application requests 38.4kb.
spd_shi Use 230.4kb when the application requests 38.4kb.
spd_warp Use 460.8kb when the application requests 38.4kb.
spd_normal Use 38.4kb when the application requests 38.4kb.
spd_cust Use the custom divisor to set the speed when the
application requests 38.4kb.
divisor This option set the custom division.
baud_base This option set the base baud rate.
============== =========================================================
6. Troubleshooting
^^^^^^^^^^^^^^^^^^
The boot time error messages and solutions are stated as clearly as
possible. If all the possible solutions fail, please contact our technical
support team to get more help.
Error msg:
More than 4 Moxa Smartio/Industio family boards found. Fifth board
and after are ignored.
Solution:
To avoid this problem, please unplug fifth and after board, because Moxa
driver supports up to 4 boards.
Error msg:
Request_irq fail, IRQ(?) may be conflict with another device.
Solution:
Other PCI or ISA devices occupy the assigned IRQ. If you are not sure
which device causes the situation, please check /proc/interrupts to find
free IRQ and simply change another free IRQ for Moxa board.
Error msg:
Board #: C1xx Series(CAP=xxx) interrupt number invalid.
Solution:
Each port within the same multiport board shares the same IRQ. Please set
one IRQ (IRQ doesn't equal to zero) for one Moxa board.
Error msg:
No interrupt vector be set for Moxa ISA board(CAP=xxx).
Solution:
Moxa ISA board needs an interrupt vector.Please refer to user's manual
"Hardware Installation" chapter to set interrupt vector.
Error msg:
Couldn't install MOXA Smartio/Industio family driver!
Solution:
Load Moxa driver fail, the major number may conflict with other devices.
Please refer to previous section 3.7 to change a free major number for
Moxa driver.
Error msg:
Couldn't install MOXA Smartio/Industio family callout driver!
Solution:
Load Moxa callout driver fail, the callout device major number may
conflict with other devices. Please refer to previous section 3.7 to
change a free callout device major number for Moxa driver.

View File

@@ -0,0 +1,103 @@
==============================
GSM 0710 tty multiplexor HOWTO
==============================
This line discipline implements the GSM 07.10 multiplexing protocol
detailed in the following 3GPP document:
http://www.3gpp.org/ftp/Specs/archive/07_series/07.10/0710-720.zip
This document give some hints on how to use this driver with GPRS and 3G
modems connected to a physical serial port.
How to use it
-------------
1. initialize the modem in 0710 mux mode (usually AT+CMUX= command) through
its serial port. Depending on the modem used, you can pass more or less
parameters to this command,
2. switch the serial line to using the n_gsm line discipline by using
TIOCSETD ioctl,
3. configure the mux using GSMIOC_GETCONF / GSMIOC_SETCONF ioctl,
Major parts of the initialization program :
(a good starting point is util-linux-ng/sys-utils/ldattach.c)::
#include <linux/gsmmux.h>
#define N_GSM0710 21 /* GSM 0710 Mux */
#define DEFAULT_SPEED B115200
#define SERIAL_PORT /dev/ttyS0
int ldisc = N_GSM0710;
struct gsm_config c;
struct termios configuration;
/* open the serial port connected to the modem */
fd = open(SERIAL_PORT, O_RDWR | O_NOCTTY | O_NDELAY);
/* configure the serial port : speed, flow control ... */
/* send the AT commands to switch the modem to CMUX mode
and check that it's successful (should return OK) */
write(fd, "AT+CMUX=0\r", 10);
/* experience showed that some modems need some time before
being able to answer to the first MUX packet so a delay
may be needed here in some case */
sleep(3);
/* use n_gsm line discipline */
ioctl(fd, TIOCSETD, &ldisc);
/* get n_gsm configuration */
ioctl(fd, GSMIOC_GETCONF, &c);
/* we are initiator and need encoding 0 (basic) */
c.initiator = 1;
c.encapsulation = 0;
/* our modem defaults to a maximum size of 127 bytes */
c.mru = 127;
c.mtu = 127;
/* set the new configuration */
ioctl(fd, GSMIOC_SETCONF, &c);
/* and wait for ever to keep the line discipline enabled */
daemon(0,0);
pause();
4. create the devices corresponding to the "virtual" serial ports (take care,
each modem has its configuration and some DLC have dedicated functions,
for example GPS), starting with minor 1 (DLC0 is reserved for the management
of the mux)::
MAJOR=`cat /proc/devices |grep gsmtty | awk '{print $1}`
for i in `seq 1 4`; do
mknod /dev/ttygsm$i c $MAJOR $i
done
5. use these devices as plain serial ports.
for example, it's possible:
- and to use gnokii to send / receive SMS on ttygsm1
- to use ppp to establish a datalink on ttygsm2
6. first close all virtual ports before closing the physical port.
Note that after closing the physical port the modem is still in multiplexing
mode. This may prevent a successful re-opening of the port later. To avoid
this situation either reset the modem if your hardware allows that or send
a disconnect command frame manually before initializing the multiplexing mode
for the second time. The byte sequence for the disconnect command frame is::
0xf9, 0x03, 0xef, 0x03, 0xc3, 0x16, 0xf9.
Additional Documentation
------------------------
More practical details on the protocol and how it's supported by industrial
modems can be found in the following documents :
- http://www.telit.com/module/infopool/download.php?id=616
- http://www.u-blox.com/images/downloads/Product_Docs/LEON-G100-G200-MuxImplementation_ApplicationNote_%28GSM%20G1-CS-10002%29.pdf
- http://www.sierrawireless.com/Support/Downloads/AirPrime/WMP_Series/~/media/Support_Downloads/AirPrime/Application_notes/CMUX_Feature_Application_Note-Rev004.ashx
- http://wm.sim.com/sim/News/photo/2010721161442.pdf
11-03-08 - Eric Bénard - <eric@eukrea.com>

View File

@@ -0,0 +1,185 @@
================================================
Comtrol(tm) RocketPort(R)/RocketModem(TM) Series
================================================
Device Driver for the Linux Operating System
============================================
Product overview
----------------
This driver provides a loadable kernel driver for the Comtrol RocketPort
and RocketModem PCI boards. These boards provide, 2, 4, 8, 16, or 32
high-speed serial ports or modems. This driver supports up to a combination
of four RocketPort or RocketModems boards in one machine simultaneously.
This file assumes that you are using the RocketPort driver which is
integrated into the kernel sources.
The driver can also be installed as an external module using the usual
"make;make install" routine. This external module driver, obtainable
from the Comtrol website listed below, is useful for updating the driver
or installing it into kernels which do not have the driver configured
into them. Installations instructions for the external module
are in the included README and HW_INSTALL files.
RocketPort ISA and RocketModem II PCI boards currently are only supported by
this driver in module form.
The RocketPort ISA board requires I/O ports to be configured by the DIP
switches on the board. See the section "ISA Rocketport Boards" below for
information on how to set the DIP switches.
You pass the I/O port to the driver using the following module parameters:
board1:
I/O port for the first ISA board
board2:
I/O port for the second ISA board
board3:
I/O port for the third ISA board
board4:
I/O port for the fourth ISA board
There is a set of utilities and scripts provided with the external driver
(downloadable from http://www.comtrol.com) that ease the configuration and
setup of the ISA cards.
The RocketModem II PCI boards require firmware to be loaded into the card
before it will function. The driver has only been tested as a module for this
board.
Installation Procedures
-----------------------
RocketPort/RocketModem PCI cards require no driver configuration, they are
automatically detected and configured.
The RocketPort driver can be installed as a module (recommended) or built
into the kernel. This is selected, as for other drivers, through the `make config`
command from the root of the Linux source tree during the kernel build process.
The RocketPort/RocketModem serial ports installed by this driver are assigned
device major number 46, and will be named /dev/ttyRx, where x is the port number
starting at zero (ex. /dev/ttyR0, /devttyR1, ...). If you have multiple cards
installed in the system, the mapping of port names to serial ports is displayed
in the system log at /var/log/messages.
If installed as a module, the module must be loaded. This can be done
manually by entering "modprobe rocket". To have the module loaded automatically
upon system boot, edit a `/etc/modprobe.d/*.conf` file and add the line
"alias char-major-46 rocket".
In order to use the ports, their device names (nodes) must be created with mknod.
This is only required once, the system will retain the names once created. To
create the RocketPort/RocketModem device names, use the command
"mknod /dev/ttyRx c 46 x" where x is the port number starting at zero.
For example::
> mknod /dev/ttyR0 c 46 0
> mknod /dev/ttyR1 c 46 1
> mknod /dev/ttyR2 c 46 2
The Linux script MAKEDEV will create the first 16 ttyRx device names (nodes)
for you::
>/dev/MAKEDEV ttyR
ISA Rocketport Boards
---------------------
You must assign and configure the I/O addresses used by the ISA Rocketport
card before installing and using it. This is done by setting a set of DIP
switches on the Rocketport board.
Setting the I/O address
-----------------------
Before installing RocketPort(R) or RocketPort RA boards, you must find
a range of I/O addresses for it to use. The first RocketPort card
requires a 68-byte contiguous block of I/O addresses, starting at one
of the following: 0x100h, 0x140h, 0x180h, 0x200h, 0x240h, 0x280h,
0x300h, 0x340h, 0x380h. This I/O address must be reflected in the DIP
switches of *all* of the Rocketport cards.
The second, third, and fourth RocketPort cards require a 64-byte
contiguous block of I/O addresses, starting at one of the following
I/O addresses: 0x100h, 0x140h, 0x180h, 0x1C0h, 0x200h, 0x240h, 0x280h,
0x2C0h, 0x300h, 0x340h, 0x380h, 0x3C0h. The I/O address used by the
second, third, and fourth Rocketport cards (if present) are set via
software control. The DIP switch settings for the I/O address must be
set to the value of the first Rocketport cards.
In order to distinguish each of the card from the others, each card
must have a unique board ID set on the dip switches. The first
Rocketport board must be set with the DIP switches corresponding to
the first board, the second board must be set with the DIP switches
corresponding to the second board, etc. IMPORTANT: The board ID is
the only place where the DIP switch settings should differ between the
various Rocketport boards in a system.
The I/O address range used by any of the RocketPort cards must not
conflict with any other cards in the system, including other
RocketPort cards. Below, you will find a list of commonly used I/O
address ranges which may be in use by other devices in your system.
On a Linux system, "cat /proc/ioports" will also be helpful in
identifying what I/O addresses are being used by devices on your
system.
Remember, the FIRST RocketPort uses 68 I/O addresses. So, if you set it
for 0x100, it will occupy 0x100 to 0x143. This would mean that you
CAN NOT set the second, third or fourth board for address 0x140 since
the first 4 bytes of that range are used by the first board. You would
need to set the second, third, or fourth board to one of the next available
blocks such as 0x180.
RocketPort and RocketPort RA SW1 Settings::
+-------------------------------+
| 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
+-------+-------+---------------+
| Unused| Card | I/O Port Block|
+-------------------------------+
DIP Switches DIP Switches
7 8 6 5
=================== ===================
On On UNUSED, MUST BE ON. On On First Card <==== Default
On Off Second Card
Off On Third Card
Off Off Fourth Card
DIP Switches I/O Address Range
4 3 2 1 Used by the First Card
=====================================
On Off On Off 100-143
On Off Off On 140-183
On Off Off Off 180-1C3 <==== Default
Off On On Off 200-243
Off On Off On 240-283
Off On Off Off 280-2C3
Off Off On Off 300-343
Off Off Off On 340-383
Off Off Off Off 380-3C3
Reporting Bugs
--------------
For technical support, please provide the following
information: Driver version, kernel release, distribution of
kernel, and type of board you are using. Error messages and log
printouts port configuration details are especially helpful.
USA:
:Phone: (612) 494-4100
:FAX: (612) 494-4199
:email: support@comtrol.com
Comtrol Europe:
:Phone: +44 (0) 1 869 323-220
:FAX: +44 (0) 1 869 323-211
:email: support@comtrol.co.uk
Web: http://www.comtrol.com
FTP: ftp.comtrol.com

View File

@@ -0,0 +1,90 @@
=============================
ISO7816 Serial Communications
=============================
1. Introduction
===============
ISO/IEC7816 is a series of standards specifying integrated circuit cards (ICC)
also known as smart cards.
2. Hardware-related considerations
==================================
Some CPUs/UARTs (e.g., Microchip AT91) contain a built-in mode capable of
handling communication with a smart card.
For these microcontrollers, the Linux driver should be made capable of
working in both modes, and proper ioctls (see later) should be made
available at user-level to allow switching from one mode to the other, and
vice versa.
3. Data Structures Already Available in the Kernel
==================================================
The Linux kernel provides the serial_iso7816 structure (see [1]) to handle
ISO7816 communications. This data structure is used to set and configure
ISO7816 parameters in ioctls.
Any driver for devices capable of working both as RS232 and ISO7816 should
implement the iso7816_config callback in the uart_port structure. The
serial_core calls iso7816_config to do the device specific part in response
to TIOCGISO7816 and TIOCSISO7816 ioctls (see below). The iso7816_config
callback receives a pointer to struct serial_iso7816.
4. Usage from user-level
========================
From user-level, ISO7816 configuration can be get/set using the previous
ioctls. For instance, to set ISO7816 you can use the following code::
#include <linux/serial.h>
/* Include definition for ISO7816 ioctls: TIOCSISO7816 and TIOCGISO7816 */
#include <sys/ioctl.h>
/* Open your specific device (e.g., /dev/mydevice): */
int fd = open ("/dev/mydevice", O_RDWR);
if (fd < 0) {
/* Error handling. See errno. */
}
struct serial_iso7816 iso7816conf;
/* Reserved fields as to be zeroed */
memset(&iso7816conf, 0, sizeof(iso7816conf));
/* Enable ISO7816 mode: */
iso7816conf.flags |= SER_ISO7816_ENABLED;
/* Select the protocol: */
/* T=0 */
iso7816conf.flags |= SER_ISO7816_T(0);
/* or T=1 */
iso7816conf.flags |= SER_ISO7816_T(1);
/* Set the guard time: */
iso7816conf.tg = 2;
/* Set the clock frequency*/
iso7816conf.clk = 3571200;
/* Set transmission factors: */
iso7816conf.sc_fi = 372;
iso7816conf.sc_di = 1;
if (ioctl(fd_usart, TIOCSISO7816, &iso7816conf) < 0) {
/* Error handling. See errno. */
}
/* Use read() and write() syscalls here... */
/* Close the device when finished: */
if (close (fd) < 0) {
/* Error handling. See errno. */
}
5. References
=============
[1] include/uapi/linux/serial.h

View File

@@ -0,0 +1,103 @@
===========================
RS485 Serial Communications
===========================
1. Introduction
===============
EIA-485, also known as TIA/EIA-485 or RS-485, is a standard defining the
electrical characteristics of drivers and receivers for use in balanced
digital multipoint systems.
This standard is widely used for communications in industrial automation
because it can be used effectively over long distances and in electrically
noisy environments.
2. Hardware-related Considerations
==================================
Some CPUs/UARTs (e.g., Atmel AT91 or 16C950 UART) contain a built-in
half-duplex mode capable of automatically controlling line direction by
toggling RTS or DTR signals. That can be used to control external
half-duplex hardware like an RS485 transceiver or any RS232-connected
half-duplex devices like some modems.
For these microcontrollers, the Linux driver should be made capable of
working in both modes, and proper ioctls (see later) should be made
available at user-level to allow switching from one mode to the other, and
vice versa.
3. Data Structures Already Available in the Kernel
==================================================
The Linux kernel provides the serial_rs485 structure (see [1]) to handle
RS485 communications. This data structure is used to set and configure RS485
parameters in the platform data and in ioctls.
The device tree can also provide RS485 boot time parameters (see [2]
for bindings). The driver is in charge of filling this data structure from
the values given by the device tree.
Any driver for devices capable of working both as RS232 and RS485 should
implement the rs485_config callback in the uart_port structure. The
serial_core calls rs485_config to do the device specific part in response
to TIOCSRS485 and TIOCGRS485 ioctls (see below). The rs485_config callback
receives a pointer to struct serial_rs485.
4. Usage from user-level
========================
From user-level, RS485 configuration can be get/set using the previous
ioctls. For instance, to set RS485 you can use the following code::
#include <linux/serial.h>
/* Include definition for RS485 ioctls: TIOCGRS485 and TIOCSRS485 */
#include <sys/ioctl.h>
/* Open your specific device (e.g., /dev/mydevice): */
int fd = open ("/dev/mydevice", O_RDWR);
if (fd < 0) {
/* Error handling. See errno. */
}
struct serial_rs485 rs485conf;
/* Enable RS485 mode: */
rs485conf.flags |= SER_RS485_ENABLED;
/* Set logical level for RTS pin equal to 1 when sending: */
rs485conf.flags |= SER_RS485_RTS_ON_SEND;
/* or, set logical level for RTS pin equal to 0 when sending: */
rs485conf.flags &= ~(SER_RS485_RTS_ON_SEND);
/* Set logical level for RTS pin equal to 1 after sending: */
rs485conf.flags |= SER_RS485_RTS_AFTER_SEND;
/* or, set logical level for RTS pin equal to 0 after sending: */
rs485conf.flags &= ~(SER_RS485_RTS_AFTER_SEND);
/* Set rts delay before send, if needed: */
rs485conf.delay_rts_before_send = ...;
/* Set rts delay after send, if needed: */
rs485conf.delay_rts_after_send = ...;
/* Set this flag if you want to receive data even while sending data */
rs485conf.flags |= SER_RS485_RX_DURING_TX;
if (ioctl (fd, TIOCSRS485, &rs485conf) < 0) {
/* Error handling. See errno. */
}
/* Use read() and write() syscalls here... */
/* Close the device when finished: */
if (close (fd) < 0) {
/* Error handling. See errno. */
}
5. References
=============
[1] include/uapi/linux/serial.h
[2] Documentation/devicetree/bindings/serial/rs485.txt

View File

@@ -0,0 +1,328 @@
=================
The Lockronomicon
=================
Your guide to the ancient and twisted locking policies of the tty layer and
the warped logic behind them. Beware all ye who read on.
Line Discipline
---------------
Line disciplines are registered with tty_register_ldisc() passing the
discipline number and the ldisc structure. At the point of registration the
discipline must be ready to use and it is possible it will get used before
the call returns success. If the call returns an error then it won't get
called. Do not re-use ldisc numbers as they are part of the userspace ABI
and writing over an existing ldisc will cause demons to eat your computer.
After the return the ldisc data has been copied so you may free your own
copy of the structure. You must not re-register over the top of the line
discipline even with the same data or your computer again will be eaten by
demons.
In order to remove a line discipline call tty_unregister_ldisc().
In ancient times this always worked. In modern times the function will
return -EBUSY if the ldisc is currently in use. Since the ldisc referencing
code manages the module counts this should not usually be a concern.
Heed this warning: the reference count field of the registered copies of the
tty_ldisc structure in the ldisc table counts the number of lines using this
discipline. The reference count of the tty_ldisc structure within a tty
counts the number of active users of the ldisc at this instant. In effect it
counts the number of threads of execution within an ldisc method (plus those
about to enter and exit although this detail matters not).
Line Discipline Methods
-----------------------
TTY side interfaces
^^^^^^^^^^^^^^^^^^^
======================= =======================================================
open() Called when the line discipline is attached to
the terminal. No other call into the line
discipline for this tty will occur until it
completes successfully. Should initialize any
state needed by the ldisc, and set receive_room
in the tty_struct to the maximum amount of data
the line discipline is willing to accept from the
driver with a single call to receive_buf().
Returning an error will prevent the ldisc from
being attached. Can sleep.
close() This is called on a terminal when the line
discipline is being unplugged. At the point of
execution no further users will enter the
ldisc code for this tty. Can sleep.
hangup() Called when the tty line is hung up.
The line discipline should cease I/O to the tty.
No further calls into the ldisc code will occur.
The return value is ignored. Can sleep.
read() (optional) A process requests reading data from
the line. Multiple read calls may occur in parallel
and the ldisc must deal with serialization issues.
If not defined, the process will receive an EIO
error. May sleep.
write() (optional) A process requests writing data to the
line. Multiple write calls are serialized by the
tty layer for the ldisc. If not defined, the
process will receive an EIO error. May sleep.
flush_buffer() (optional) May be called at any point between
open and close, and instructs the line discipline
to empty its input buffer.
set_termios() (optional) Called on termios structure changes.
The caller passes the old termios data and the
current data is in the tty. Called under the
termios semaphore so allowed to sleep. Serialized
against itself only.
poll() (optional) Check the status for the poll/select
calls. Multiple poll calls may occur in parallel.
May sleep.
ioctl() (optional) Called when an ioctl is handed to the
tty layer that might be for the ldisc. Multiple
ioctl calls may occur in parallel. May sleep.
compat_ioctl() (optional) Called when a 32 bit ioctl is handed
to the tty layer that might be for the ldisc.
Multiple ioctl calls may occur in parallel.
May sleep.
======================= =======================================================
Driver Side Interfaces
^^^^^^^^^^^^^^^^^^^^^^
======================= =======================================================
receive_buf() (optional) Called by the low-level driver to hand
a buffer of received bytes to the ldisc for
processing. The number of bytes is guaranteed not
to exceed the current value of tty->receive_room.
All bytes must be processed.
receive_buf2() (optional) Called by the low-level driver to hand
a buffer of received bytes to the ldisc for
processing. Returns the number of bytes processed.
If both receive_buf() and receive_buf2() are
defined, receive_buf2() should be preferred.
write_wakeup() May be called at any point between open and close.
The TTY_DO_WRITE_WAKEUP flag indicates if a call
is needed but always races versus calls. Thus the
ldisc must be careful about setting order and to
handle unexpected calls. Must not sleep.
The driver is forbidden from calling this directly
from the ->write call from the ldisc as the ldisc
is permitted to call the driver write method from
this function. In such a situation defer it.
dcd_change() Report to the tty line the current DCD pin status
changes and the relative timestamp. The timestamp
cannot be NULL.
======================= =======================================================
Driver Access
^^^^^^^^^^^^^
Line discipline methods can call the following methods of the underlying
hardware driver through the function pointers within the tty->driver
structure:
======================= =======================================================
write() Write a block of characters to the tty device.
Returns the number of characters accepted. The
character buffer passed to this method is already
in kernel space.
put_char() Queues a character for writing to the tty device.
If there is no room in the queue, the character is
ignored.
flush_chars() (Optional) If defined, must be called after
queueing characters with put_char() in order to
start transmission.
write_room() Returns the numbers of characters the tty driver
will accept for queueing to be written.
ioctl() Invoke device specific ioctl.
Expects data pointers to refer to userspace.
Returns ENOIOCTLCMD for unrecognized ioctl numbers.
set_termios() Notify the tty driver that the device's termios
settings have changed. New settings are in
tty->termios. Previous settings should be passed in
the "old" argument.
The API is defined such that the driver should return
the actual modes selected. This means that the
driver function is responsible for modifying any
bits in the request it cannot fulfill to indicate
the actual modes being used. A device with no
hardware capability for change (e.g. a USB dongle or
virtual port) can provide NULL for this method.
throttle() Notify the tty driver that input buffers for the
line discipline are close to full, and it should
somehow signal that no more characters should be
sent to the tty.
unthrottle() Notify the tty driver that characters can now be
sent to the tty without fear of overrunning the
input buffers of the line disciplines.
stop() Ask the tty driver to stop outputting characters
to the tty device.
start() Ask the tty driver to resume sending characters
to the tty device.
hangup() Ask the tty driver to hang up the tty device.
break_ctl() (Optional) Ask the tty driver to turn on or off
BREAK status on the RS-232 port. If state is -1,
then the BREAK status should be turned on; if
state is 0, then BREAK should be turned off.
If this routine is not implemented, use ioctls
TIOCSBRK / TIOCCBRK instead.
wait_until_sent() Waits until the device has written out all of the
characters in its transmitter FIFO.
send_xchar() Send a high-priority XON/XOFF character to the device.
======================= =======================================================
Flags
^^^^^
Line discipline methods have access to tty->flags field containing the
following interesting flags:
======================= =======================================================
TTY_THROTTLED Driver input is throttled. The ldisc should call
tty->driver->unthrottle() in order to resume
reception when it is ready to process more data.
TTY_DO_WRITE_WAKEUP If set, causes the driver to call the ldisc's
write_wakeup() method in order to resume
transmission when it can accept more data
to transmit.
TTY_IO_ERROR If set, causes all subsequent userspace read/write
calls on the tty to fail, returning -EIO.
TTY_OTHER_CLOSED Device is a pty and the other side has closed.
TTY_NO_WRITE_SPLIT Prevent driver from splitting up writes into
smaller chunks.
======================= =======================================================
Locking
^^^^^^^
Callers to the line discipline functions from the tty layer are required to
take line discipline locks. The same is true of calls from the driver side
but not yet enforced.
Three calls are now provided::
ldisc = tty_ldisc_ref(tty);
takes a handle to the line discipline in the tty and returns it. If no ldisc
is currently attached or the ldisc is being closed and re-opened at this
point then NULL is returned. While this handle is held the ldisc will not
change or go away::
tty_ldisc_deref(ldisc)
Returns the ldisc reference and allows the ldisc to be closed. Returning the
reference takes away your right to call the ldisc functions until you take
a new reference::
ldisc = tty_ldisc_ref_wait(tty);
Performs the same function as tty_ldisc_ref except that it will wait for an
ldisc change to complete and then return a reference to the new ldisc.
While these functions are slightly slower than the old code they should have
minimal impact as most receive logic uses the flip buffers and they only
need to take a reference when they push bits up through the driver.
A caution: The ldisc->open(), ldisc->close() and driver->set_ldisc
functions are called with the ldisc unavailable. Thus tty_ldisc_ref will
fail in this situation if used within these functions. Ldisc and driver
code calling its own functions must be careful in this case.
Driver Interface
----------------
======================= =======================================================
open() Called when a device is opened. May sleep
close() Called when a device is closed. At the point of
return from this call the driver must make no
further ldisc calls of any kind. May sleep
write() Called to write bytes to the device. May not
sleep. May occur in parallel in special cases.
Because this includes panic paths drivers generally
shouldn't try and do clever locking here.
put_char() Stuff a single character onto the queue. The
driver is guaranteed following up calls to
flush_chars.
flush_chars() Ask the kernel to write put_char queue
write_room() Return the number of characters that can be stuffed
into the port buffers without overflow (or less).
The ldisc is responsible for being intelligent
about multi-threading of write_room/write calls
ioctl() Called when an ioctl may be for the driver
set_termios() Called on termios change, serialized against
itself by a semaphore. May sleep.
set_ldisc() Notifier for discipline change. At the point this
is done the discipline is not yet usable. Can now
sleep (I think)
throttle() Called by the ldisc to ask the driver to do flow
control. Serialization including with unthrottle
is the job of the ldisc layer.
unthrottle() Called by the ldisc to ask the driver to stop flow
control.
stop() Ldisc notifier to the driver to stop output. As with
throttle the serializations with start() are down
to the ldisc layer.
start() Ldisc notifier to the driver to start output.
hangup() Ask the tty driver to cause a hangup initiated
from the host side. [Can sleep ??]
break_ctl() Send RS232 break. Can sleep. Can get called in
parallel, driver must serialize (for now), and
with write calls.
wait_until_sent() Wait for characters to exit the hardware queue
of the driver. Can sleep
send_xchar() Send XON/XOFF and if possible jump the queue with
it in order to get fast flow control responses.
Cannot sleep ??
======================= =======================================================

View File

@@ -0,0 +1,49 @@
====================================
SGI IOC4 PCI (multi function) device
====================================
The SGI IOC4 PCI device is a bit of a strange beast, so some notes on
it are in order.
First, even though the IOC4 performs multiple functions, such as an
IDE controller, a serial controller, a PS/2 keyboard/mouse controller,
and an external interrupt mechanism, it's not implemented as a
multifunction device. The consequence of this from a software
standpoint is that all these functions share a single IRQ, and
they can't all register to own the same PCI device ID. To make
matters a bit worse, some of the register blocks (and even registers
themselves) present in IOC4 are mixed-purpose between these several
functions, meaning that there's no clear "owning" device driver.
The solution is to organize the IOC4 driver into several independent
drivers, "ioc4", "sgiioc4", and "ioc4_serial". Note that there is no
PS/2 controller driver as this functionality has never been wired up
on a shipping IO card.
ioc4
====
This is the core (or shim) driver for IOC4. It is responsible for
initializing the basic functionality of the chip, and allocating
the PCI resources that are shared between the IOC4 functions.
This driver also provides registration functions that the other
IOC4 drivers can call to make their presence known. Each driver
needs to provide a probe and remove function, which are invoked
by the core driver at appropriate times. The interface of these
IOC4 function probe and remove operations isn't precisely the same
as PCI device probe and remove operations, but is logically the
same operation.
sgiioc4
=======
This is the IDE driver for IOC4. Its name isn't very descriptive
simply for historical reasons (it used to be the only IOC4 driver
component). There's not much to say about it other than it hooks
up to the ioc4 driver via the appropriate registration, probe, and
remove functions.
ioc4_serial
===========
This is the serial driver for IOC4. There's not much to say about it
other than it hooks up to the ioc4 driver via the appropriate registration,
probe, and remove functions.

View File

@@ -0,0 +1,74 @@
.. include:: <isonum.txt>
============
SM501 Driver
============
:Copyright: |copy| 2006, 2007 Simtec Electronics
The Silicon Motion SM501 multimedia companion chip is a multifunction device
which may provide numerous interfaces including USB host controller USB gadget,
asynchronous serial ports, audio functions, and a dual display video interface.
The device may be connected by PCI or local bus with varying functions enabled.
Core
----
The core driver in drivers/mfd provides common services for the
drivers which manage the specific hardware blocks. These services
include locking for common registers, clock control and resource
management.
The core registers drivers for both PCI and generic bus based
chips via the platform device and driver system.
On detection of a device, the core initialises the chip (which may
be specified by the platform data) and then exports the selected
peripheral set as platform devices for the specific drivers.
The core re-uses the platform device system as the platform device
system provides enough features to support the drivers without the
need to create a new bus-type and the associated code to go with it.
Resources
---------
Each peripheral has a view of the device which is implicitly narrowed to
the specific set of resources that peripheral requires in order to
function correctly.
The centralised memory allocation allows the driver to ensure that the
maximum possible resource allocation can be made to the video subsystem
as this is by-far the most resource-sensitive of the on-chip functions.
The primary issue with memory allocation is that of moving the video
buffers once a display mode is chosen. Indeed when a video mode change
occurs the memory footprint of the video subsystem changes.
Since video memory is difficult to move without changing the display
(unless sufficient contiguous memory can be provided for the old and new
modes simultaneously) the video driver fully utilises the memory area
given to it by aligning fb0 to the start of the area and fb1 to the end
of it. Any memory left over in the middle is used for the acceleration
functions, which are transient and thus their location is less critical
as it can be moved.
Configuration
-------------
The platform device driver uses a set of platform data to pass
configurations through to the core and the subsidiary drivers
so that there can be support for more than one system carrying
an SM501 built into a single kernel image.
The PCI driver assumes that the PCI card behaves as per the Silicon
Motion reference design.
There is an errata (AB-5) affecting the selection of the
of the M1XCLK and M1CLK frequencies. These two clocks
must be sourced from the same PLL, although they can then
be divided down individually. If this is not set, then SM501 may
lock and hang the whole system. The driver will refuse to
attach if the PLL selection is different.

View File

@@ -0,0 +1,60 @@
=================================================
Msc Keyboard Scan Expansion/GPIO Expansion device
=================================================
What is smsc-ece1099?
----------------------
The ECE1099 is a 40-Pin 3.3V Keyboard Scan Expansion
or GPIO Expansion device. The device supports a keyboard
scan matrix of 23x8. The device is connected to a Master
via the SMSC BC-Link interface or via the SMBus.
Keypad scan Input(KSI) and Keypad Scan Output(KSO) signals
are multiplexed with GPIOs.
Interrupt generation
--------------------
Interrupts can be generated by an edge detection on a GPIO
pin or an edge detection on one of the bus interface pins.
Interrupts can also be detected on the keyboard scan interface.
The bus interrupt pin (BC_INT# or SMBUS_INT#) is asserted if
any bit in one of the Interrupt Status registers is 1 and
the corresponding Interrupt Mask bit is also 1.
In order for software to determine which device is the source
of an interrupt, it should first read the Group Interrupt Status Register
to determine which Status register group is a source for the interrupt.
Software should read both the Status register and the associated Mask register,
then AND the two values together. Bits that are 1 in the result of the AND
are active interrupts. Software clears an interrupt by writing a 1 to the
corresponding bit in the Status register.
Communication Protocol
----------------------
- SMbus slave Interface
The host processor communicates with the ECE1099 device
through a series of read/write registers via the SMBus
interface. SMBus is a serial communication protocol between
a computer host and its peripheral devices. The SMBus data
rate is 10KHz minimum to 400 KHz maximum
- Slave Bus Interface
The ECE1099 device SMBus implementation is a subset of the
SMBus interface to the host. The device is a slave-only SMBus device.
The implementation in the device is a subset of SMBus since it
only supports four protocols.
The Write Byte, Read Byte, Send Byte, and Receive Byte protocols are the
only valid SMBus protocols for the device.
- BC-LinkTM Interface
The BC-Link is a proprietary bus that allows communication
between a Master device and a Companion device. The Master
device uses this serial bus to read and write registers
located on the Companion device. The bus comprises three signals,
BC_CLK, BC_DAT and BC_INT#. The Master device always provides the
clock, BC_CLK, and the Companion device is the source for an
independent asynchronous interrupt signal, BC_INT#. The ECE1099
supports BC-Link speeds up to 24MHz.

View File

@@ -0,0 +1,102 @@
========================
Linux Switchtec Support
========================
Microsemi's "Switchtec" line of PCI switch devices is already
supported by the kernel with standard PCI switch drivers. However, the
Switchtec device advertises a special management endpoint which
enables some additional functionality. This includes:
* Packet and Byte Counters
* Firmware Upgrades
* Event and Error logs
* Querying port link status
* Custom user firmware commands
The switchtec kernel module implements this functionality.
Interface
=========
The primary means of communicating with the Switchtec management firmware is
through the Memory-mapped Remote Procedure Call (MRPC) interface.
Commands are submitted to the interface with a 4-byte command
identifier and up to 1KB of command specific data. The firmware will
respond with a 4-byte return code and up to 1KB of command-specific
data. The interface only processes a single command at a time.
Userspace Interface
===================
The MRPC interface will be exposed to userspace through a simple char
device: /dev/switchtec#, one for each management endpoint in the system.
The char device has the following semantics:
* A write must consist of at least 4 bytes and no more than 1028 bytes.
The first 4 bytes will be interpreted as the Command ID and the
remainder will be used as the input data. A write will send the
command to the firmware to begin processing.
* Each write must be followed by exactly one read. Any double write will
produce an error and any read that doesn't follow a write will
produce an error.
* A read will block until the firmware completes the command and return
the 4-byte Command Return Value plus up to 1024 bytes of output
data. (The length will be specified by the size parameter of the read
call -- reading less than 4 bytes will produce an error.)
* The poll call will also be supported for userspace applications that
need to do other things while waiting for the command to complete.
The following IOCTLs are also supported by the device:
* SWITCHTEC_IOCTL_FLASH_INFO - Retrieve firmware length and number
of partitions in the device.
* SWITCHTEC_IOCTL_FLASH_PART_INFO - Retrieve address and lengeth for
any specified partition in flash.
* SWITCHTEC_IOCTL_EVENT_SUMMARY - Read a structure of bitmaps
indicating all uncleared events.
* SWITCHTEC_IOCTL_EVENT_CTL - Get the current count, clear and set flags
for any event. This ioctl takes in a switchtec_ioctl_event_ctl struct
with the event_id, index and flags set (index being the partition or PFF
number for non-global events). It returns whether the event has
occurred, the number of times and any event specific data. The flags
can be used to clear the count or enable and disable actions to
happen when the event occurs.
By using the SWITCHTEC_IOCTL_EVENT_FLAG_EN_POLL flag,
you can set an event to trigger a poll command to return with
POLLPRI. In this way, userspace can wait for events to occur.
* SWITCHTEC_IOCTL_PFF_TO_PORT and SWITCHTEC_IOCTL_PORT_TO_PFF convert
between PCI Function Framework number (used by the event system)
and Switchtec Logic Port ID and Partition number (which is more
user friendly).
Non-Transparent Bridge (NTB) Driver
===================================
An NTB hardware driver is provided for the Switchtec hardware in
ntb_hw_switchtec. Currently, it only supports switches configured with
exactly 2 NT partitions and zero or more non-NT partitions. It also requires
the following configuration settings:
* Both NT partitions must be able to access each other's GAS spaces.
Thus, the bits in the GAS Access Vector under Management Settings
must be set to support this.
* Kernel configuration MUST include support for NTB (CONFIG_NTB needs
to be set)
NT EP BAR 2 will be dynamically configured as a Direct Window, and
the configuration file does not need to configure it explicitly.
Please refer to Documentation/driver-api/ntb.rst in Linux source tree for an overall
understanding of the Linux NTB stack. ntb_hw_switchtec works as an NTB
Hardware Driver in this stack.

View File

@@ -0,0 +1,86 @@
===================
Sync File API Guide
===================
:Author: Gustavo Padovan <gustavo at padovan dot org>
This document serves as a guide for device drivers writers on what the
sync_file API is, and how drivers can support it. Sync file is the carrier of
the fences(struct dma_fence) that are needed to synchronize between drivers or
across process boundaries.
The sync_file API is meant to be used to send and receive fence information
to/from userspace. It enables userspace to do explicit fencing, where instead
of attaching a fence to the buffer a producer driver (such as a GPU or V4L
driver) sends the fence related to the buffer to userspace via a sync_file.
The sync_file then can be sent to the consumer (DRM driver for example), that
will not use the buffer for anything before the fence(s) signals, i.e., the
driver that issued the fence is not using/processing the buffer anymore, so it
signals that the buffer is ready to use. And vice-versa for the consumer ->
producer part of the cycle.
Sync files allows userspace awareness on buffer sharing synchronization between
drivers.
Sync file was originally added in the Android kernel but current Linux Desktop
can benefit a lot from it.
in-fences and out-fences
------------------------
Sync files can go either to or from userspace. When a sync_file is sent from
the driver to userspace we call the fences it contains 'out-fences'. They are
related to a buffer that the driver is processing or is going to process, so
the driver creates an out-fence to be able to notify, through
dma_fence_signal(), when it has finished using (or processing) that buffer.
Out-fences are fences that the driver creates.
On the other hand if the driver receives fence(s) through a sync_file from
userspace we call these fence(s) 'in-fences'. Receiving in-fences means that
we need to wait for the fence(s) to signal before using any buffer related to
the in-fences.
Creating Sync Files
-------------------
When a driver needs to send an out-fence userspace it creates a sync_file.
Interface::
struct sync_file *sync_file_create(struct dma_fence *fence);
The caller pass the out-fence and gets back the sync_file. That is just the
first step, next it needs to install an fd on sync_file->file. So it gets an
fd::
fd = get_unused_fd_flags(O_CLOEXEC);
and installs it on sync_file->file::
fd_install(fd, sync_file->file);
The sync_file fd now can be sent to userspace.
If the creation process fail, or the sync_file needs to be released by any
other reason fput(sync_file->file) should be used.
Receiving Sync Files from Userspace
-----------------------------------
When userspace needs to send an in-fence to the driver it passes file descriptor
of the Sync File to the kernel. The kernel can then retrieve the fences
from it.
Interface::
struct dma_fence *sync_file_get_fence(int fd);
The returned reference is owned by the caller and must be disposed of
afterwards using dma_fence_put(). In case of error, a NULL is returned instead.
References:
1. struct sync_file in include/linux/sync_file.h
2. All interfaces mentioned above defined in include/linux/sync_file.h

View File

@@ -0,0 +1,414 @@
.. include:: <isonum.txt>
=====================
VFIO Mediated devices
=====================
:Copyright: |copy| 2016, NVIDIA CORPORATION. All rights reserved.
:Author: Neo Jia <cjia@nvidia.com>
:Author: Kirti Wankhede <kwankhede@nvidia.com>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License version 2 as
published by the Free Software Foundation.
Virtual Function I/O (VFIO) Mediated devices[1]
===============================================
The number of use cases for virtualizing DMA devices that do not have built-in
SR_IOV capability is increasing. Previously, to virtualize such devices,
developers had to create their own management interfaces and APIs, and then
integrate them with user space software. To simplify integration with user space
software, we have identified common requirements and a unified management
interface for such devices.
The VFIO driver framework provides unified APIs for direct device access. It is
an IOMMU/device-agnostic framework for exposing direct device access to user
space in a secure, IOMMU-protected environment. This framework is used for
multiple devices, such as GPUs, network adapters, and compute accelerators. With
direct device access, virtual machines or user space applications have direct
access to the physical device. This framework is reused for mediated devices.
The mediated core driver provides a common interface for mediated device
management that can be used by drivers of different devices. This module
provides a generic interface to perform these operations:
* Create and destroy a mediated device
* Add a mediated device to and remove it from a mediated bus driver
* Add a mediated device to and remove it from an IOMMU group
The mediated core driver also provides an interface to register a bus driver.
For example, the mediated VFIO mdev driver is designed for mediated devices and
supports VFIO APIs. The mediated bus driver adds a mediated device to and
removes it from a VFIO group.
The following high-level block diagram shows the main components and interfaces
in the VFIO mediated driver framework. The diagram shows NVIDIA, Intel, and IBM
devices as examples, as these devices are the first devices to use this module::
+---------------+
| |
| +-----------+ | mdev_register_driver() +--------------+
| | | +<------------------------+ |
| | mdev | | | |
| | bus | +------------------------>+ vfio_mdev.ko |<-> VFIO user
| | driver | | probe()/remove() | | APIs
| | | | +--------------+
| +-----------+ |
| |
| MDEV CORE |
| MODULE |
| mdev.ko |
| +-----------+ | mdev_register_device() +--------------+
| | | +<------------------------+ |
| | | | | nvidia.ko |<-> physical
| | | +------------------------>+ | device
| | | | callbacks +--------------+
| | Physical | |
| | device | | mdev_register_device() +--------------+
| | interface | |<------------------------+ |
| | | | | i915.ko |<-> physical
| | | +------------------------>+ | device
| | | | callbacks +--------------+
| | | |
| | | | mdev_register_device() +--------------+
| | | +<------------------------+ |
| | | | | ccw_device.ko|<-> physical
| | | +------------------------>+ | device
| | | | callbacks +--------------+
| +-----------+ |
+---------------+
Registration Interfaces
=======================
The mediated core driver provides the following types of registration
interfaces:
* Registration interface for a mediated bus driver
* Physical device driver interface
Registration Interface for a Mediated Bus Driver
------------------------------------------------
The registration interface for a mediated bus driver provides the following
structure to represent a mediated device's driver::
/*
* struct mdev_driver [2] - Mediated device's driver
* @name: driver name
* @probe: called when new device created
* @remove: called when device removed
* @driver: device driver structure
*/
struct mdev_driver {
const char *name;
int (*probe) (struct device *dev);
void (*remove) (struct device *dev);
struct device_driver driver;
};
A mediated bus driver for mdev should use this structure in the function calls
to register and unregister itself with the core driver:
* Register::
extern int mdev_register_driver(struct mdev_driver *drv,
struct module *owner);
* Unregister::
extern void mdev_unregister_driver(struct mdev_driver *drv);
The mediated bus driver is responsible for adding mediated devices to the VFIO
group when devices are bound to the driver and removing mediated devices from
the VFIO when devices are unbound from the driver.
Physical Device Driver Interface
--------------------------------
The physical device driver interface provides the mdev_parent_ops[3] structure
to define the APIs to manage work in the mediated core driver that is related
to the physical device.
The structures in the mdev_parent_ops structure are as follows:
* dev_attr_groups: attributes of the parent device
* mdev_attr_groups: attributes of the mediated device
* supported_config: attributes to define supported configurations
The functions in the mdev_parent_ops structure are as follows:
* create: allocate basic resources in a driver for a mediated device
* remove: free resources in a driver when a mediated device is destroyed
(Note that mdev-core provides no implicit serialization of create/remove
callbacks per mdev parent device, per mdev type, or any other categorization.
Vendor drivers are expected to be fully asynchronous in this respect or
provide their own internal resource protection.)
The callbacks in the mdev_parent_ops structure are as follows:
* open: open callback of mediated device
* close: close callback of mediated device
* ioctl: ioctl callback of mediated device
* read : read emulation callback
* write: write emulation callback
* mmap: mmap emulation callback
A driver should use the mdev_parent_ops structure in the function call to
register itself with the mdev core driver::
extern int mdev_register_device(struct device *dev,
const struct mdev_parent_ops *ops);
However, the mdev_parent_ops structure is not required in the function call
that a driver should use to unregister itself with the mdev core driver::
extern void mdev_unregister_device(struct device *dev);
Mediated Device Management Interface Through sysfs
==================================================
The management interface through sysfs enables user space software, such as
libvirt, to query and configure mediated devices in a hardware-agnostic fashion.
This management interface provides flexibility to the underlying physical
device's driver to support features such as:
* Mediated device hot plug
* Multiple mediated devices in a single virtual machine
* Multiple mediated devices from different physical devices
Links in the mdev_bus Class Directory
-------------------------------------
The /sys/class/mdev_bus/ directory contains links to devices that are registered
with the mdev core driver.
Directories and files under the sysfs for Each Physical Device
--------------------------------------------------------------
::
|- [parent physical device]
|--- Vendor-specific-attributes [optional]
|--- [mdev_supported_types]
| |--- [<type-id>]
| | |--- create
| | |--- name
| | |--- available_instances
| | |--- device_api
| | |--- description
| | |--- [devices]
| |--- [<type-id>]
| | |--- create
| | |--- name
| | |--- available_instances
| | |--- device_api
| | |--- description
| | |--- [devices]
| |--- [<type-id>]
| |--- create
| |--- name
| |--- available_instances
| |--- device_api
| |--- description
| |--- [devices]
* [mdev_supported_types]
The list of currently supported mediated device types and their details.
[<type-id>], device_api, and available_instances are mandatory attributes
that should be provided by vendor driver.
* [<type-id>]
The [<type-id>] name is created by adding the device driver string as a prefix
to the string provided by the vendor driver. This format of this name is as
follows::
sprintf(buf, "%s-%s", dev_driver_string(parent->dev), group->name);
(or using mdev_parent_dev(mdev) to arrive at the parent device outside
of the core mdev code)
* device_api
This attribute should show which device API is being created, for example,
"vfio-pci" for a PCI device.
* available_instances
This attribute should show the number of devices of type <type-id> that can be
created.
* [device]
This directory contains links to the devices of type <type-id> that have been
created.
* name
This attribute should show human readable name. This is optional attribute.
* description
This attribute should show brief features/description of the type. This is
optional attribute.
Directories and Files Under the sysfs for Each mdev Device
----------------------------------------------------------
::
|- [parent phy device]
|--- [$MDEV_UUID]
|--- remove
|--- mdev_type {link to its type}
|--- vendor-specific-attributes [optional]
* remove (write only)
Writing '1' to the 'remove' file destroys the mdev device. The vendor driver can
fail the remove() callback if that device is active and the vendor driver
doesn't support hot unplug.
Example::
# echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove
Mediated device Hot plug
------------------------
Mediated devices can be created and assigned at runtime. The procedure to hot
plug a mediated device is the same as the procedure to hot plug a PCI device.
Translation APIs for Mediated Devices
=====================================
The following APIs are provided for translating user pfn to host pfn in a VFIO
driver::
extern int vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
int npage, int prot, unsigned long *phys_pfn);
extern int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn,
int npage);
These functions call back into the back-end IOMMU module by using the pin_pages
and unpin_pages callbacks of the struct vfio_iommu_driver_ops[4]. Currently
these callbacks are supported in the TYPE1 IOMMU module. To enable them for
other IOMMU backend modules, such as PPC64 sPAPR module, they need to provide
these two callback functions.
Using the Sample Code
=====================
mtty.c in samples/vfio-mdev/ directory is a sample driver program to
demonstrate how to use the mediated device framework.
The sample driver creates an mdev device that simulates a serial port over a PCI
card.
1. Build and load the mtty.ko module.
This step creates a dummy device, /sys/devices/virtual/mtty/mtty/
Files in this device directory in sysfs are similar to the following::
# tree /sys/devices/virtual/mtty/mtty/
/sys/devices/virtual/mtty/mtty/
|-- mdev_supported_types
| |-- mtty-1
| | |-- available_instances
| | |-- create
| | |-- device_api
| | |-- devices
| | `-- name
| `-- mtty-2
| |-- available_instances
| |-- create
| |-- device_api
| |-- devices
| `-- name
|-- mtty_dev
| `-- sample_mtty_dev
|-- power
| |-- autosuspend_delay_ms
| |-- control
| |-- runtime_active_time
| |-- runtime_status
| `-- runtime_suspended_time
|-- subsystem -> ../../../../class/mtty
`-- uevent
2. Create a mediated device by using the dummy device that you created in the
previous step::
# echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \
/sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create
3. Add parameters to qemu-kvm::
-device vfio-pci,\
sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
4. Boot the VM.
In the Linux guest VM, with no hardware on the host, the device appears
as follows::
# lspci -s 00:05.0 -xxvv
00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550])
Subsystem: Device 4348:3253
Physical Slot: 5
Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 10
Region 0: I/O ports at c150 [size=8]
Region 1: I/O ports at c158 [size=8]
Kernel driver in use: serial
00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00
10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32
30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00
In the Linux guest VM, dmesg output for the device is as follows:
serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 10
0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A
0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A
5. In the Linux guest VM, check the serial ports::
# setserial -g /dev/ttyS*
/dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4
/dev/ttyS1, UART: 16550A, Port: 0xc150, IRQ: 10
/dev/ttyS2, UART: 16550A, Port: 0xc158, IRQ: 10
6. Using minicom or any terminal emulation program, open port /dev/ttyS1 or
/dev/ttyS2 with hardware flow control disabled.
7. Type data on the minicom terminal or send data to the terminal emulation
program and read the data.
Data is loop backed from hosts mtty driver.
8. Destroy the mediated device that you created::
# echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001/remove
References
==========
1. See Documentation/driver-api/vfio.rst for more information on VFIO.
2. struct mdev_driver in include/linux/mdev.h
3. struct mdev_parent_ops in include/linux/mdev.h
4. struct vfio_iommu_driver_ops in include/linux/vfio.h

View File

@@ -0,0 +1,520 @@
==================================
VFIO - "Virtual Function I/O" [1]_
==================================
Many modern system now provide DMA and interrupt remapping facilities
to help ensure I/O devices behave within the boundaries they've been
allotted. This includes x86 hardware with AMD-Vi and Intel VT-d,
POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
systems such as Freescale PAMU. The VFIO driver is an IOMMU/device
agnostic framework for exposing direct device access to userspace, in
a secure, IOMMU protected environment. In other words, this allows
safe [2]_, non-privileged, userspace drivers.
Why do we want that? Virtual machines often make use of direct device
access ("device assignment") when configured for the highest possible
I/O performance. From a device and host perspective, this simply
turns the VM into a userspace driver, with the benefits of
significantly reduced latency, higher bandwidth, and direct use of
bare-metal device drivers [3]_.
Some applications, particularly in the high performance computing
field, also benefit from low-overhead, direct device access from
userspace. Examples include network adapters (often non-TCP/IP based)
and compute accelerators. Prior to VFIO, these drivers had to either
go through the full development cycle to become proper upstream
driver, be maintained out of tree, or make use of the UIO framework,
which has no notion of IOMMU protection, limited interrupt support,
and requires root privileges to access things like PCI configuration
space.
The VFIO driver framework intends to unify these, replacing both the
KVM PCI specific device assignment code as well as provide a more
secure, more featureful userspace driver environment than UIO.
Groups, Devices, and IOMMUs
---------------------------
Devices are the main target of any I/O driver. Devices typically
create a programming interface made up of I/O access, interrupts,
and DMA. Without going into the details of each of these, DMA is
by far the most critical aspect for maintaining a secure environment
as allowing a device read-write access to system memory imposes the
greatest risk to the overall system integrity.
To help mitigate this risk, many modern IOMMUs now incorporate
isolation properties into what was, in many cases, an interface only
meant for translation (ie. solving the addressing problems of devices
with limited address spaces). With this, devices can now be isolated
from each other and from arbitrary memory access, thus allowing
things like secure direct assignment of devices into virtual machines.
This isolation is not always at the granularity of a single device
though. Even when an IOMMU is capable of this, properties of devices,
interconnects, and IOMMU topologies can each reduce this isolation.
For instance, an individual device may be part of a larger multi-
function enclosure. While the IOMMU may be able to distinguish
between devices within the enclosure, the enclosure may not require
transactions between devices to reach the IOMMU. Examples of this
could be anything from a multi-function PCI device with backdoors
between functions to a non-PCI-ACS (Access Control Services) capable
bridge allowing redirection without reaching the IOMMU. Topology
can also play a factor in terms of hiding devices. A PCIe-to-PCI
bridge masks the devices behind it, making transaction appear as if
from the bridge itself. Obviously IOMMU design plays a major factor
as well.
Therefore, while for the most part an IOMMU may have device level
granularity, any system is susceptible to reduced granularity. The
IOMMU API therefore supports a notion of IOMMU groups. A group is
a set of devices which is isolatable from all other devices in the
system. Groups are therefore the unit of ownership used by VFIO.
While the group is the minimum granularity that must be used to
ensure secure user access, it's not necessarily the preferred
granularity. In IOMMUs which make use of page tables, it may be
possible to share a set of page tables between different groups,
reducing the overhead both to the platform (reduced TLB thrashing,
reduced duplicate page tables), and to the user (programming only
a single set of translations). For this reason, VFIO makes use of
a container class, which may hold one or more groups. A container
is created by simply opening the /dev/vfio/vfio character device.
On its own, the container provides little functionality, with all
but a couple version and extension query interfaces locked away.
The user needs to add a group into the container for the next level
of functionality. To do this, the user first needs to identify the
group associated with the desired device. This can be done using
the sysfs links described in the example below. By unbinding the
device from the host driver and binding it to a VFIO driver, a new
VFIO group will appear for the group as /dev/vfio/$GROUP, where
$GROUP is the IOMMU group number of which the device is a member.
If the IOMMU group contains multiple devices, each will need to
be bound to a VFIO driver before operations on the VFIO group
are allowed (it's also sufficient to only unbind the device from
host drivers if a VFIO driver is unavailable; this will make the
group available, but not that particular device). TBD - interface
for disabling driver probing/locking a device.
Once the group is ready, it may be added to the container by opening
the VFIO group character device (/dev/vfio/$GROUP) and using the
VFIO_GROUP_SET_CONTAINER ioctl, passing the file descriptor of the
previously opened container file. If desired and if the IOMMU driver
supports sharing the IOMMU context between groups, multiple groups may
be set to the same container. If a group fails to set to a container
with existing groups, a new empty container will need to be used
instead.
With a group (or groups) attached to a container, the remaining
ioctls become available, enabling access to the VFIO IOMMU interfaces.
Additionally, it now becomes possible to get file descriptors for each
device within a group using an ioctl on the VFIO group file descriptor.
The VFIO device API includes ioctls for describing the device, the I/O
regions and their read/write/mmap offsets on the device descriptor, as
well as mechanisms for describing and registering interrupt
notifications.
VFIO Usage Example
------------------
Assume user wants to access PCI device 0000:06:0d.0::
$ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group
../../../../kernel/iommu_groups/26
This device is therefore in IOMMU group 26. This device is on the
pci bus, therefore the user will make use of vfio-pci to manage the
group::
# modprobe vfio-pci
Binding this device to the vfio-pci driver creates the VFIO group
character devices for this group::
$ lspci -n -s 0000:06:0d.0
06:0d.0 0401: 1102:0002 (rev 08)
# echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
# echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
Now we need to look at what other devices are in the group to free
it for use by VFIO::
$ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices
total 0
lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 ->
../../../../devices/pci0000:00/0000:00:1e.0
lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 ->
../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 ->
../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
This device is behind a PCIe-to-PCI bridge [4]_, therefore we also
need to add device 0000:06:0d.1 to the group following the same
procedure as above. Device 0000:00:1e.0 is a bridge that does
not currently have a host driver, therefore it's not required to
bind this device to the vfio-pci driver (vfio-pci does not currently
support PCI bridges).
The final step is to provide the user with access to the group if
unprivileged operation is desired (note that /dev/vfio/vfio provides
no capabilities on its own and is therefore expected to be set to
mode 0666 by the system)::
# chown user:user /dev/vfio/26
The user now has full access to all the devices and the iommu for this
group and can access them as follows::
int container, group, device, i;
struct vfio_group_status group_status =
{ .argsz = sizeof(group_status) };
struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) };
struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };
struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
/* Create a new container */
container = open("/dev/vfio/vfio", O_RDWR);
if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
/* Unknown API version */
if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
/* Doesn't support the IOMMU driver we want. */
/* Open the group */
group = open("/dev/vfio/26", O_RDWR);
/* Test the group is viable and available */
ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);
if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE))
/* Group is not viable (ie, not all devices bound for vfio) */
/* Add the group to the container */
ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
/* Enable the IOMMU model we want */
ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
/* Get addition IOMMU info */
ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);
/* Allocate some space and setup a DMA mapping */
dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
dma_map.size = 1024 * 1024;
dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
/* Get a file descriptor for the device */
device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
/* Test and setup the device */
ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
for (i = 0; i < device_info.num_regions; i++) {
struct vfio_region_info reg = { .argsz = sizeof(reg) };
reg.index = i;
ioctl(device, VFIO_DEVICE_GET_REGION_INFO, &reg);
/* Setup mappings... read/write offsets, mmaps
* For PCI devices, config space is a region */
}
for (i = 0; i < device_info.num_irqs; i++) {
struct vfio_irq_info irq = { .argsz = sizeof(irq) };
irq.index = i;
ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq);
/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */
}
/* Gratuitous device reset and go... */
ioctl(device, VFIO_DEVICE_RESET);
VFIO User API
-------------------------------------------------------------------------------
Please see include/linux/vfio.h for complete API documentation.
VFIO bus driver API
-------------------------------------------------------------------------------
VFIO bus drivers, such as vfio-pci make use of only a few interfaces
into VFIO core. When devices are bound and unbound to the driver,
the driver should call vfio_add_group_dev() and vfio_del_group_dev()
respectively::
extern int vfio_add_group_dev(struct device *dev,
const struct vfio_device_ops *ops,
void *device_data);
extern void *vfio_del_group_dev(struct device *dev);
vfio_add_group_dev() indicates to the core to begin tracking the
iommu_group of the specified dev and register the dev as owned by
a VFIO bus driver. The driver provides an ops structure for callbacks
similar to a file operations structure::
struct vfio_device_ops {
int (*open)(void *device_data);
void (*release)(void *device_data);
ssize_t (*read)(void *device_data, char __user *buf,
size_t count, loff_t *ppos);
ssize_t (*write)(void *device_data, const char __user *buf,
size_t size, loff_t *ppos);
long (*ioctl)(void *device_data, unsigned int cmd,
unsigned long arg);
int (*mmap)(void *device_data, struct vm_area_struct *vma);
};
Each function is passed the device_data that was originally registered
in the vfio_add_group_dev() call above. This allows the bus driver
an easy place to store its opaque, private data. The open/release
callbacks are issued when a new file descriptor is created for a
device (via VFIO_GROUP_GET_DEVICE_FD). The ioctl interface provides
a direct pass through for VFIO_DEVICE_* ioctls. The read/write/mmap
interfaces implement the device region access defined by the device's
own VFIO_DEVICE_GET_REGION_INFO ioctl.
PPC64 sPAPR implementation note
-------------------------------
This implementation has some specifics:
1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
container is supported as an IOMMU table is allocated at the boot time,
one table per a IOMMU group which is a Partitionable Endpoint (PE)
(PE is often a PCI domain but not always).
Newer systems (POWER8 with IODA2) have improved hardware design which allows
to remove this limitation and have multiple IOMMU groups per a VFIO
container.
2) The hardware supports so called DMA windows - the PCI address range
within which DMA transfer is allowed, any attempt to access address space
out of the window leads to the whole PE isolation.
3) PPC64 guests are paravirtualized but not fully emulated. There is an API
to map/unmap pages for DMA, and it normally maps 1..32 pages per call and
currently there is no way to reduce the number of calls. In order to make
things faster, the map/unmap handling has been implemented in real mode
which provides an excellent performance which has limitations such as
inability to do locked pages accounting in real time.
4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
subtree that can be treated as a unit for the purposes of partitioning and
error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
function of a multi-function IOA, or multiple IOAs (possibly including
switch and bridge structures above the multiple IOAs). PPC64 guests detect
PCI errors and recover from them via EEH RTAS services, which works on the
basis of additional ioctl commands.
So 4 additional ioctls have been added:
VFIO_IOMMU_SPAPR_TCE_GET_INFO
returns the size and the start of the DMA window on the PCI bus.
VFIO_IOMMU_ENABLE
enables the container. The locked pages accounting
is done at this point. This lets user first to know what
the DMA window is and adjust rlimit before doing any real job.
VFIO_IOMMU_DISABLE
disables the container.
VFIO_EEH_PE_OP
provides an API for EEH setup, error detection and recovery.
The code flow from the example above should be slightly changed::
struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 };
.....
/* Add the group to the container */
ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
/* Enable the IOMMU model we want */
ioctl(container, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU)
/* Get addition sPAPR IOMMU info */
vfio_iommu_spapr_tce_info spapr_iommu_info;
ioctl(container, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &spapr_iommu_info);
if (ioctl(container, VFIO_IOMMU_ENABLE))
/* Cannot enable container, may be low rlimit */
/* Allocate some space and setup a DMA mapping */
dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
dma_map.size = 1024 * 1024;
dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
/* Check here is .iova/.size are within DMA window from spapr_iommu_info */
ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
/* Get a file descriptor for the device */
device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
....
/* Gratuitous device reset and go... */
ioctl(device, VFIO_DEVICE_RESET);
/* Make sure EEH is supported */
ioctl(container, VFIO_CHECK_EXTENSION, VFIO_EEH);
/* Enable the EEH functionality on the device */
pe_op.op = VFIO_EEH_PE_ENABLE;
ioctl(container, VFIO_EEH_PE_OP, &pe_op);
/* You're suggested to create additional data struct to represent
* PE, and put child devices belonging to same IOMMU group to the
* PE instance for later reference.
*/
/* Check the PE's state and make sure it's in functional state */
pe_op.op = VFIO_EEH_PE_GET_STATE;
ioctl(container, VFIO_EEH_PE_OP, &pe_op);
/* Save device state using pci_save_state().
* EEH should be enabled on the specified device.
*/
....
/* Inject EEH error, which is expected to be caused by 32-bits
* config load.
*/
pe_op.op = VFIO_EEH_PE_INJECT_ERR;
pe_op.err.type = EEH_ERR_TYPE_32;
pe_op.err.func = EEH_ERR_FUNC_LD_CFG_ADDR;
pe_op.err.addr = 0ul;
pe_op.err.mask = 0ul;
ioctl(container, VFIO_EEH_PE_OP, &pe_op);
....
/* When 0xFF's returned from reading PCI config space or IO BARs
* of the PCI device. Check the PE's state to see if that has been
* frozen.
*/
ioctl(container, VFIO_EEH_PE_OP, &pe_op);
/* Waiting for pending PCI transactions to be completed and don't
* produce any more PCI traffic from/to the affected PE until
* recovery is finished.
*/
/* Enable IO for the affected PE and collect logs. Usually, the
* standard part of PCI config space, AER registers are dumped
* as logs for further analysis.
*/
pe_op.op = VFIO_EEH_PE_UNFREEZE_IO;
ioctl(container, VFIO_EEH_PE_OP, &pe_op);
/*
* Issue PE reset: hot or fundamental reset. Usually, hot reset
* is enough. However, the firmware of some PCI adapters would
* require fundamental reset.
*/
pe_op.op = VFIO_EEH_PE_RESET_HOT;
ioctl(container, VFIO_EEH_PE_OP, &pe_op);
pe_op.op = VFIO_EEH_PE_RESET_DEACTIVATE;
ioctl(container, VFIO_EEH_PE_OP, &pe_op);
/* Configure the PCI bridges for the affected PE */
pe_op.op = VFIO_EEH_PE_CONFIGURE;
ioctl(container, VFIO_EEH_PE_OP, &pe_op);
/* Restored state we saved at initialization time. pci_restore_state()
* is good enough as an example.
*/
/* Hopefully, error is recovered successfully. Now, you can resume to
* start PCI traffic to/from the affected PE.
*/
....
5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
VFIO_IOMMU_DISABLE and implements 2 new ioctls:
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
(which are unsupported in v1 IOMMU).
PPC64 paravirtualized guests generate a lot of map/unmap requests,
and the handling of those includes pinning/unpinning pages and updating
mm::locked_vm counter to make sure we do not exceed the rlimit.
The v2 IOMMU splits accounting and pinning into separate operations:
- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
receive a user space address and size of the block to be pinned.
Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
be called with the exact address and size used for registering
the memory block. The userspace is not expected to call these often.
The ranges are stored in a linked list in a VFIO container.
- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
IOMMU table and do not do pinning; instead these check that the userspace
address is from pre-registered range.
This separation helps in optimizing DMA for guests.
6) sPAPR specification allows guests to have an additional DMA window(s) on
a PCI bus with a variable page size. Two ioctls have been added to support
this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE.
The platform has to support the functionality or error will be returned to
the userspace. The existing hardware supports up to 2 DMA windows, one is
2GB long, uses 4K pages and called "default 32bit window"; the other can
be as big as entire RAM, use different page size, it is optional - guests
create those in run-time if the guest driver supports 64bit DMA.
VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and
a number of TCE table levels (if a TCE table is going to be big enough and
the kernel may not be able to allocate enough of physically contiguous
memory). It creates a new window in the available slot and returns the bus
address where the new window starts. Due to hardware limitation, the user
space cannot choose the location of DMA windows.
VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window
and removes it.
-------------------------------------------------------------------------------
.. [1] VFIO was originally an acronym for "Virtual Function I/O" in its
initial implementation by Tom Lyon while as Cisco. We've since
outgrown the acronym, but it's catchy.
.. [2] "safe" also depends upon a device being "well behaved". It's
possible for multi-function devices to have backdoors between
functions and even for single function devices to have alternative
access to things like PCI config space through MMIO registers. To
guard against the former we can include additional precautions in the
IOMMU driver to group multi-function PCI devices together
(iommu=group_mf). The latter we can't prevent, but the IOMMU should
still provide isolation. For PCI, SR-IOV Virtual Functions are the
best indicator of "well behaved", as these are designed for
virtualization usage models.
.. [3] As always there are trade-offs to virtual machine device
assignment that are beyond the scope of VFIO. It's expected that
future IOMMU technologies will reduce some, but maybe not all, of
these trade-offs.
.. [4] In this case the device is below a PCI bridge, so transactions
from either function of the device are indistinguishable to the iommu::
-[0000:00]-+-1e.0-[06]--+-0d.0
\-0d.1
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)

View File

@@ -0,0 +1,67 @@
====================================
Xilinx Zynq MPSoC EEMI Documentation
====================================
Xilinx Zynq MPSoC Firmware Interface
-------------------------------------
The zynqmp-firmware node describes the interface to platform firmware.
ZynqMP has an interface to communicate with secure firmware. Firmware
driver provides an interface to firmware APIs. Interface APIs can be
used by any driver to communicate with PMC(Platform Management Controller).
Embedded Energy Management Interface (EEMI)
----------------------------------------------
The embedded energy management interface is used to allow software
components running across different processing clusters on a chip or
device to communicate with a power management controller (PMC) on a
device to issue or respond to power management requests.
EEMI ops is a structure containing all eemi APIs supported by Zynq MPSoC.
The zynqmp-firmware driver maintain all EEMI APIs in zynqmp_eemi_ops
structure. Any driver who want to communicate with PMC using EEMI APIs
can call zynqmp_pm_get_eemi_ops().
Example of EEMI ops::
/* zynqmp-firmware driver maintain all EEMI APIs */
struct zynqmp_eemi_ops {
int (*get_api_version)(u32 *version);
int (*query_data)(struct zynqmp_pm_query_data qdata, u32 *out);
};
static const struct zynqmp_eemi_ops eemi_ops = {
.get_api_version = zynqmp_pm_get_api_version,
.query_data = zynqmp_pm_query_data,
};
Example of EEMI ops usage::
static const struct zynqmp_eemi_ops *eemi_ops;
u32 ret_payload[PAYLOAD_ARG_CNT];
int ret;
eemi_ops = zynqmp_pm_get_eemi_ops();
if (IS_ERR(eemi_ops))
return PTR_ERR(eemi_ops);
ret = eemi_ops->query_data(qdata, ret_payload);
IOCTL
------
IOCTL API is for device control and configuration. It is not a system
IOCTL but it is an EEMI API. This API can be used by master to control
any device specific configuration. IOCTL definitions can be platform
specific. This API also manage shared device configuration.
The following IOCTL IDs are valid for device control:
- IOCTL_SET_PLL_FRAC_MODE 8
- IOCTL_GET_PLL_FRAC_MODE 9
- IOCTL_SET_PLL_FRAC_DATA 10
- IOCTL_GET_PLL_FRAC_DATA 11
Refer EEMI API guide [0] for IOCTL specific parameters and other EEMI APIs.
References
----------
[0] Embedded Energy Management Interface (EEMI) API guide:
https://www.xilinx.com/support/documentation/user_guides/ug1200-eemi-api.pdf

View File

@@ -0,0 +1,16 @@
===========
Xilinx FPGA
===========
.. toctree::
:maxdepth: 1
eemi
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@@ -0,0 +1,379 @@
==========================================
Xillybus driver for generic FPGA interface
==========================================
:Author: Eli Billauer, Xillybus Ltd. (http://xillybus.com)
:Email: eli.billauer@gmail.com or as advertised on Xillybus' site.
.. Contents:
- Introduction
-- Background
-- Xillybus Overview
- Usage
-- User interface
-- Synchronization
-- Seekable pipes
- Internals
-- Source code organization
-- Pipe attributes
-- Host never reads from the FPGA
-- Channels, pipes, and the message channel
-- Data streaming
-- Data granularity
-- Probing
-- Buffer allocation
-- The "nonempty" message (supporting poll)
Introduction
============
Background
----------
An FPGA (Field Programmable Gate Array) is a piece of logic hardware, which
can be programmed to become virtually anything that is usually found as a
dedicated chipset: For instance, a display adapter, network interface card,
or even a processor with its peripherals. FPGAs are the LEGO of hardware:
Based upon certain building blocks, you make your own toys the way you like
them. It's usually pointless to reimplement something that is already
available on the market as a chipset, so FPGAs are mostly used when some
special functionality is needed, and the production volume is relatively low
(hence not justifying the development of an ASIC).
The challenge with FPGAs is that everything is implemented at a very low
level, even lower than assembly language. In order to allow FPGA designers to
focus on their specific project, and not reinvent the wheel over and over
again, pre-designed building blocks, IP cores, are often used. These are the
FPGA parallels of library functions. IP cores may implement certain
mathematical functions, a functional unit (e.g. a USB interface), an entire
processor (e.g. ARM) or anything that might come handy. Think of them as a
building block, with electrical wires dangling on the sides for connection to
other blocks.
One of the daunting tasks in FPGA design is communicating with a fullblown
operating system (actually, with the processor running it): Implementing the
low-level bus protocol and the somewhat higher-level interface with the host
(registers, interrupts, DMA etc.) is a project in itself. When the FPGA's
function is a well-known one (e.g. a video adapter card, or a NIC), it can
make sense to design the FPGA's interface logic specifically for the project.
A special driver is then written to present the FPGA as a well-known interface
to the kernel and/or user space. In that case, there is no reason to treat the
FPGA differently than any device on the bus.
It's however common that the desired data communication doesn't fit any well-
known peripheral function. Also, the effort of designing an elegant
abstraction for the data exchange is often considered too big. In those cases,
a quicker and possibly less elegant solution is sought: The driver is
effectively written as a user space program, leaving the kernel space part
with just elementary data transport. This still requires designing some
interface logic for the FPGA, and write a simple ad-hoc driver for the kernel.
Xillybus Overview
-----------------
Xillybus is an IP core and a Linux driver. Together, they form a kit for
elementary data transport between an FPGA and the host, providing pipe-like
data streams with a straightforward user interface. It's intended as a low-
effort solution for mixed FPGA-host projects, for which it makes sense to
have the project-specific part of the driver running in a user-space program.
Since the communication requirements may vary significantly from one FPGA
project to another (the number of data pipes needed in each direction and
their attributes), there isn't one specific chunk of logic being the Xillybus
IP core. Rather, the IP core is configured and built based upon a
specification given by its end user.
Xillybus presents independent data streams, which resemble pipes or TCP/IP
communication to the user. At the host side, a character device file is used
just like any pipe file. On the FPGA side, hardware FIFOs are used to stream
the data. This is contrary to a common method of communicating through fixed-
sized buffers (even though such buffers are used by Xillybus under the hood).
There may be more than a hundred of these streams on a single IP core, but
also no more than one, depending on the configuration.
In order to ease the deployment of the Xillybus IP core, it contains a simple
data structure which completely defines the core's configuration. The Linux
driver fetches this data structure during its initialization process, and sets
up the DMA buffers and character devices accordingly. As a result, a single
driver is used to work out of the box with any Xillybus IP core.
The data structure just mentioned should not be confused with PCI's
configuration space or the Flattened Device Tree.
Usage
=====
User interface
--------------
On the host, all interface with Xillybus is done through /dev/xillybus_*
device files, which are generated automatically as the drivers loads. The
names of these files depend on the IP core that is loaded in the FPGA (see
Probing below). To communicate with the FPGA, open the device file that
corresponds to the hardware FIFO you want to send data or receive data from,
and use plain write() or read() calls, just like with a regular pipe. In
particular, it makes perfect sense to go::
$ cat mydata > /dev/xillybus_thisfifo
$ cat /dev/xillybus_thatfifo > hisdata
possibly pressing CTRL-C as some stage, even though the xillybus_* pipes have
the capability to send an EOF (but may not use it).
The driver and hardware are designed to behave sensibly as pipes, including:
* Supporting non-blocking I/O (by setting O_NONBLOCK on open() ).
* Supporting poll() and select().
* Being bandwidth efficient under load (using DMA) but also handle small
pieces of data sent across (like TCP/IP) by autoflushing.
A device file can be read only, write only or bidirectional. Bidirectional
device files are treated like two independent pipes (except for sharing a
"channel" structure in the implementation code).
Synchronization
---------------
Xillybus pipes are configured (on the IP core) to be either synchronous or
asynchronous. For a synchronous pipe, write() returns successfully only after
some data has been submitted and acknowledged by the FPGA. This slows down
bulk data transfers, and is nearly impossible for use with streams that
require data at a constant rate: There is no data transmitted to the FPGA
between write() calls, in particular when the process loses the CPU.
When a pipe is configured asynchronous, write() returns if there was enough
room in the buffers to store any of the data in the buffers.
For FPGA to host pipes, asynchronous pipes allow data transfer from the FPGA
as soon as the respective device file is opened, regardless of if the data
has been requested by a read() call. On synchronous pipes, only the amount
of data requested by a read() call is transmitted.
In summary, for synchronous pipes, data between the host and FPGA is
transmitted only to satisfy the read() or write() call currently handled
by the driver, and those calls wait for the transmission to complete before
returning.
Note that the synchronization attribute has nothing to do with the possibility
that read() or write() completes less bytes than requested. There is a
separate configuration flag ("allowpartial") that determines whether such a
partial completion is allowed.
Seekable pipes
--------------
A synchronous pipe can be configured to have the stream's position exposed
to the user logic at the FPGA. Such a pipe is also seekable on the host API.
With this feature, a memory or register interface can be attached on the
FPGA side to the seekable stream. Reading or writing to a certain address in
the attached memory is done by seeking to the desired address, and calling
read() or write() as required.
Internals
=========
Source code organization
------------------------
The Xillybus driver consists of a core module, xillybus_core.c, and modules
that depend on the specific bus interface (xillybus_of.c and xillybus_pcie.c).
The bus specific modules are those probed when a suitable device is found by
the kernel. Since the DMA mapping and synchronization functions, which are bus
dependent by their nature, are used by the core module, a
xilly_endpoint_hardware structure is passed to the core module on
initialization. This structure is populated with pointers to wrapper functions
which execute the DMA-related operations on the bus.
Pipe attributes
---------------
Each pipe has a number of attributes which are set when the FPGA component
(IP core) is built. They are fetched from the IDT (the data structure which
defines the core's configuration, see Probing below) by xilly_setupchannels()
in xillybus_core.c as follows:
* is_writebuf: The pipe's direction. A non-zero value means it's an FPGA to
host pipe (the FPGA "writes").
* channelnum: The pipe's identification number in communication between the
host and FPGA.
* format: The underlying data width. See Data Granularity below.
* allowpartial: A non-zero value means that a read() or write() (whichever
applies) may return with less than the requested number of bytes. The common
choice is a non-zero value, to match standard UNIX behavior.
* synchronous: A non-zero value means that the pipe is synchronous. See
Synchronization above.
* bufsize: Each DMA buffer's size. Always a power of two.
* bufnum: The number of buffers allocated for this pipe. Always a power of two.
* exclusive_open: A non-zero value forces exclusive opening of the associated
device file. If the device file is bidirectional, and already opened only in
one direction, the opposite direction may be opened once.
* seekable: A non-zero value indicates that the pipe is seekable. See
Seekable pipes above.
* supports_nonempty: A non-zero value (which is typical) indicates that the
hardware will send the messages that are necessary to support select() and
poll() for this pipe.
Host never reads from the FPGA
------------------------------
Even though PCI Express is hotpluggable in general, a typical motherboard
doesn't expect a card to go away all of the sudden. But since the PCIe card
is based upon reprogrammable logic, a sudden disappearance from the bus is
quite likely as a result of an accidental reprogramming of the FPGA while the
host is up. In practice, nothing happens immediately in such a situation. But
if the host attempts to read from an address that is mapped to the PCI Express
device, that leads to an immediate freeze of the system on some motherboards,
even though the PCIe standard requires a graceful recovery.
In order to avoid these freezes, the Xillybus driver refrains completely from
reading from the device's register space. All communication from the FPGA to
the host is done through DMA. In particular, the Interrupt Service Routine
doesn't follow the common practice of checking a status register when it's
invoked. Rather, the FPGA prepares a small buffer which contains short
messages, which inform the host what the interrupt was about.
This mechanism is used on non-PCIe buses as well for the sake of uniformity.
Channels, pipes, and the message channel
----------------------------------------
Each of the (possibly bidirectional) pipes presented to the user is allocated
a data channel between the FPGA and the host. The distinction between channels
and pipes is necessary only because of channel 0, which is used for interrupt-
related messages from the FPGA, and has no pipe attached to it.
Data streaming
--------------
Even though a non-segmented data stream is presented to the user at both
sides, the implementation relies on a set of DMA buffers which is allocated
for each channel. For the sake of illustration, let's take the FPGA to host
direction: As data streams into the respective channel's interface in the
FPGA, the Xillybus IP core writes it to one of the DMA buffers. When the
buffer is full, the FPGA informs the host about that (appending a
XILLYMSG_OPCODE_RELEASEBUF message channel 0 and sending an interrupt if
necessary). The host responds by making the data available for reading through
the character device. When all data has been read, the host writes on the
the FPGA's buffer control register, allowing the buffer's overwriting. Flow
control mechanisms exist on both sides to prevent underflows and overflows.
This is not good enough for creating a TCP/IP-like stream: If the data flow
stops momentarily before a DMA buffer is filled, the intuitive expectation is
that the partial data in buffer will arrive anyhow, despite the buffer not
being completed. This is implemented by adding a field in the
XILLYMSG_OPCODE_RELEASEBUF message, through which the FPGA informs not just
which buffer is submitted, but how much data it contains.
But the FPGA will submit a partially filled buffer only if directed to do so
by the host. This situation occurs when the read() method has been blocking
for XILLY_RX_TIMEOUT jiffies (currently 10 ms), after which the host commands
the FPGA to submit a DMA buffer as soon as it can. This timeout mechanism
balances between bus bandwidth efficiency (preventing a lot of partially
filled buffers being sent) and a latency held fairly low for tails of data.
A similar setting is used in the host to FPGA direction. The handling of
partial DMA buffers is somewhat different, though. The user can tell the
driver to submit all data it has in the buffers to the FPGA, by issuing a
write() with the byte count set to zero. This is similar to a flush request,
but it doesn't block. There is also an autoflushing mechanism, which triggers
an equivalent flush roughly XILLY_RX_TIMEOUT jiffies after the last write().
This allows the user to be oblivious about the underlying buffering mechanism
and yet enjoy a stream-like interface.
Note that the issue of partial buffer flushing is irrelevant for pipes having
the "synchronous" attribute nonzero, since synchronous pipes don't allow data
to lay around in the DMA buffers between read() and write() anyhow.
Data granularity
----------------
The data arrives or is sent at the FPGA as 8, 16 or 32 bit wide words, as
configured by the "format" attribute. Whenever possible, the driver attempts
to hide this when the pipe is accessed differently from its natural alignment.
For example, reading single bytes from a pipe with 32 bit granularity works
with no issues. Writing single bytes to pipes with 16 or 32 bit granularity
will also work, but the driver can't send partially completed words to the
FPGA, so the transmission of up to one word may be held until it's fully
occupied with user data.
This somewhat complicates the handling of host to FPGA streams, because
when a buffer is flushed, it may contain up to 3 bytes don't form a word in
the FPGA, and hence can't be sent. To prevent loss of data, these leftover
bytes need to be moved to the next buffer. The parts in xillybus_core.c
that mention "leftovers" in some way are related to this complication.
Probing
-------
As mentioned earlier, the number of pipes that are created when the driver
loads and their attributes depend on the Xillybus IP core in the FPGA. During
the driver's initialization, a blob containing configuration info, the
Interface Description Table (IDT), is sent from the FPGA to the host. The
bootstrap process is done in three phases:
1. Acquire the length of the IDT, so a buffer can be allocated for it. This
is done by sending a quiesce command to the device, since the acknowledge
for this command contains the IDT's buffer length.
2. Acquire the IDT itself.
3. Create the interfaces according to the IDT.
Buffer allocation
-----------------
In order to simplify the logic that prevents illegal boundary crossings of
PCIe packets, the following rule applies: If a buffer is smaller than 4kB,
it must not cross a 4kB boundary. Otherwise, it must be 4kB aligned. The
xilly_setupchannels() functions allocates these buffers by requesting whole
pages from the kernel, and diving them into DMA buffers as necessary. Since
all buffers' sizes are powers of two, it's possible to pack any set of such
buffers, with a maximal waste of one page of memory.
All buffers are allocated when the driver is loaded. This is necessary,
since large continuous physical memory segments are sometimes requested,
which are more likely to be available when the system is freshly booted.
The allocation of buffer memory takes place in the same order they appear in
the IDT. The driver relies on a rule that the pipes are sorted with decreasing
buffer size in the IDT. If a requested buffer is larger or equal to a page,
the necessary number of pages is requested from the kernel, and these are
used for this buffer. If the requested buffer is smaller than a page, one
single page is requested from the kernel, and that page is partially used.
Or, if there already is a partially used page at hand, the buffer is packed
into that page. It can be shown that all pages requested from the kernel
(except possibly for the last) are 100% utilized this way.
The "nonempty" message (supporting poll)
----------------------------------------
In order to support the "poll" method (and hence select() ), there is a small
catch regarding the FPGA to host direction: The FPGA may have filled a DMA
buffer with some data, but not submitted that buffer. If the host waited for
the buffer's submission by the FPGA, there would be a possibility that the
FPGA side has sent data, but a select() call would still block, because the
host has not received any notification about this. This is solved with
XILLYMSG_OPCODE_NONEMPTY messages sent by the FPGA when a channel goes from
completely empty to containing some data.
These messages are used only to support poll() and select(). The IP core can
be configured not to send them for a slight reduction of bandwidth.

View File

@@ -0,0 +1,104 @@
========================================
Writing Device Drivers for Zorro Devices
========================================
:Author: Written by Geert Uytterhoeven <geert@linux-m68k.org>
:Last revised: September 5, 2003
Introduction
------------
The Zorro bus is the bus used in the Amiga family of computers. Thanks to
AutoConfig(tm), it's 100% Plug-and-Play.
There are two types of Zorro buses, Zorro II and Zorro III:
- The Zorro II address space is 24-bit and lies within the first 16 MB of the
Amiga's address map.
- Zorro III is a 32-bit extension of Zorro II, which is backwards compatible
with Zorro II. The Zorro III address space lies outside the first 16 MB.
Probing for Zorro Devices
-------------------------
Zorro devices are found by calling ``zorro_find_device()``, which returns a
pointer to the ``next`` Zorro device with the specified Zorro ID. A probe loop
for the board with Zorro ID ``ZORRO_PROD_xxx`` looks like::
struct zorro_dev *z = NULL;
while ((z = zorro_find_device(ZORRO_PROD_xxx, z))) {
if (!zorro_request_region(z->resource.start+MY_START, MY_SIZE,
"My explanation"))
...
}
``ZORRO_WILDCARD`` acts as a wildcard and finds any Zorro device. If your driver
supports different types of boards, you can use a construct like::
struct zorro_dev *z = NULL;
while ((z = zorro_find_device(ZORRO_WILDCARD, z))) {
if (z->id != ZORRO_PROD_xxx1 && z->id != ZORRO_PROD_xxx2 && ...)
continue;
if (!zorro_request_region(z->resource.start+MY_START, MY_SIZE,
"My explanation"))
...
}
Zorro Resources
---------------
Before you can access a Zorro device's registers, you have to make sure it's
not yet in use. This is done using the I/O memory space resource management
functions::
request_mem_region()
release_mem_region()
Shortcuts to claim the whole device's address space are provided as well::
zorro_request_device
zorro_release_device
Accessing the Zorro Address Space
---------------------------------
The address regions in the Zorro device resources are Zorro bus address
regions. Due to the identity bus-physical address mapping on the Zorro bus,
they are CPU physical addresses as well.
The treatment of these regions depends on the type of Zorro space:
- Zorro II address space is always mapped and does not have to be mapped
explicitly using z_ioremap().
Conversion from bus/physical Zorro II addresses to kernel virtual addresses
and vice versa is done using::
virt_addr = ZTWO_VADDR(bus_addr);
bus_addr = ZTWO_PADDR(virt_addr);
- Zorro III address space must be mapped explicitly using z_ioremap() first
before it can be accessed::
virt_addr = z_ioremap(bus_addr, size);
...
z_iounmap(virt_addr);
References
----------
#. linux/include/linux/zorro.h
#. linux/include/uapi/linux/zorro.h
#. linux/include/uapi/linux/zorro_ids.h
#. linux/arch/m68k/include/asm/zorro.h
#. linux/drivers/zorro
#. /proc/bus/zorro