123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445 |
- =======================================
- Oracle Data Analytics Accelerator (DAX)
- =======================================
- DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
- (DAX2) processor chips, and has direct access to the CPU's L3 caches
- as well as physical memory. It can perform several operations on data
- streams with various input and output formats. A driver provides a
- transport mechanism and has limited knowledge of the various opcodes
- and data formats. A user space library provides high level services
- and translates these into low level commands which are then passed
- into the driver and subsequently the Hypervisor and the coprocessor.
- The library is the recommended way for applications to use the
- coprocessor, and the driver interface is not intended for general use.
- This document describes the general flow of the driver, its
- structures, and its programmatic interface. It also provides example
- code sufficient to write user or kernel applications that use DAX
- functionality.
- The user library is open source and available at:
- https://oss.oracle.com/git/gitweb.cgi?p=libdax.git
- The Hypervisor interface to the coprocessor is described in detail in
- the accompanying document, dax-hv-api.txt, which is a plain text
- excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
- Specification" version 3.0.20+15, dated 2017-09-25.
- High Level Overview
- ===================
- A coprocessor request is described by a Command Control Block
- (CCB). The CCB contains an opcode and various parameters. The opcode
- specifies what operation is to be done, and the parameters specify
- options, flags, sizes, and addresses. The CCB (or an array of CCBs)
- is passed to the Hypervisor, which handles queueing and scheduling of
- requests to the available coprocessor execution units. A status code
- returned indicates if the request was submitted successfully or if
- there was an error. One of the addresses given in each CCB is a
- pointer to a "completion area", which is a 128 byte memory block that
- is written by the coprocessor to provide execution status. No
- interrupt is generated upon completion; the completion area must be
- polled by software to find out when a transaction has finished, but
- the M7 and later processors provide a mechanism to pause the virtual
- processor until the completion status has been updated by the
- coprocessor. This is done using the monitored load and mwait
- instructions, which are described in more detail later. The DAX
- coprocessor was designed so that after a request is submitted, the
- kernel is no longer involved in the processing of it. The polling is
- done at the user level, which results in almost zero latency between
- completion of a request and resumption of execution of the requesting
- thread.
- Addressing Memory
- =================
- The kernel does not have access to physical memory in the Sun4v
- architecture, as there is an additional level of memory virtualization
- present. This intermediate level is called "real" memory, and the
- kernel treats this as if it were physical. The Hypervisor handles the
- translations between real memory and physical so that each logical
- domain (LDOM) can have a partition of physical memory that is isolated
- from that of other LDOMs. When the kernel sets up a virtual mapping,
- it specifies a virtual address and the real address to which it should
- be mapped.
- The DAX coprocessor can only operate on physical memory, so before a
- request can be fed to the coprocessor, all the addresses in a CCB must
- be converted into physical addresses. The kernel cannot do this since
- it has no visibility into physical addresses. So a CCB may contain
- either the virtual or real addresses of the buffers or a combination
- of them. An "address type" field is available for each address that
- may be given in the CCB. In all cases, the Hypervisor will translate
- all the addresses to physical before dispatching to hardware. Address
- translations are performed using the context of the process initiating
- the request.
- The Driver API
- ==============
- An application makes requests to the driver via the write() system
- call, and gets results (if any) via read(). The completion areas are
- made accessible via mmap(), and are read-only for the application.
- The request may either be an immediate command or an array of CCBs to
- be submitted to the hardware.
- Each open instance of the device is exclusive to the thread that
- opened it, and must be used by that thread for all subsequent
- operations. The driver open function creates a new context for the
- thread and initializes it for use. This context contains pointers and
- values used internally by the driver to keep track of submitted
- requests. The completion area buffer is also allocated, and this is
- large enough to contain the completion areas for many concurrent
- requests. When the device is closed, any outstanding transactions are
- flushed and the context is cleaned up.
- On a DAX1 system (M7), the device will be called "oradax1", while on a
- DAX2 system (M8) it will be "oradax2". If an application requires one
- or the other, it should simply attempt to open the appropriate
- device. Only one of the devices will exist on any given system, so the
- name can be used to determine what the platform supports.
- The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
- all of these, success is indicated by a return value from write()
- equal to the number of bytes given in the call. Otherwise -1 is
- returned and errno is set.
- CCB_DEQUEUE
- -----------
- Tells the driver to clean up resources associated with past
- requests. Since no interrupt is generated upon the completion of a
- request, the driver must be told when it may reclaim resources. No
- further status information is returned, so the user should not
- subsequently call read().
- CCB_KILL
- --------
- Kills a CCB during execution. The CCB is guaranteed to not continue
- executing once this call returns successfully. On success, read() must
- be called to retrieve the result of the action.
- CCB_INFO
- --------
- Retrieves information about a currently executing CCB. Note that some
- Hypervisors might return 'notfound' when the CCB is in 'inprogress'
- state. To ensure a CCB in the 'notfound' state will never be executed,
- CCB_KILL must be invoked on that CCB. Upon success, read() must be
- called to retrieve the details of the action.
- Submission of an array of CCBs for execution
- ---------------------------------------------
- A write() whose length is a multiple of the CCB size is treated as a
- submit operation. The file offset is treated as the index of the
- completion area to use, and may be set via lseek() or using the
- pwrite() system call. If -1 is returned then errno is set to indicate
- the error. Otherwise, the return value is the length of the array that
- was actually accepted by the coprocessor. If the accepted length is
- equal to the requested length, then the submission was completely
- successful and there is no further status needed; hence, the user
- should not subsequently call read(). Partial acceptance of the CCB
- array is indicated by a return value less than the requested length,
- and read() must be called to retrieve further status information. The
- status will reflect the error caused by the first CCB that was not
- accepted, and status_data will provide additional data in some cases.
- MMAP
- ----
- The mmap() function provides access to the completion area allocated
- in the driver. Note that the completion area is not writeable by the
- user process, and the mmap call must not specify PROT_WRITE.
- Completion of a Request
- =======================
- The first byte in each completion area is the command status which is
- updated by the coprocessor hardware. Software may take advantage of
- new M7/M8 processor capabilities to efficiently poll this status byte.
- First, a "monitored load" is achieved via a Load from Alternate Space
- (ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY). Second, a
- "monitored wait" is achieved via the mwait instruction (a write to
- %asr28). This instruction is like pause in that it suspends execution
- of the virtual processor for the given number of nanoseconds, but in
- addition will terminate early when one of several events occur. If the
- block of data containing the monitored location is modified, then the
- mwait terminates. This causes software to resume execution immediately
- (without a context switch or kernel to user transition) after a
- transaction completes. Thus the latency between transaction completion
- and resumption of execution may be just a few nanoseconds.
- Application Life Cycle of a DAX Submission
- ==========================================
- - open dax device
- - call mmap() to get the completion area address
- - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
- - submit CCB via write() or pwrite()
- - go into a loop executing monitored load + monitored wait and
- terminate when the command status indicates the request is complete
- (CCB_KILL or CCB_INFO may be used any time as necessary)
- - perform a CCB_DEQUEUE
- - call munmap() for completion area
- - close the dax device
- Memory Constraints
- ==================
- The DAX hardware operates only on physical addresses. Therefore, it is
- not aware of virtual memory mappings and the discontiguities that may
- exist in the physical memory that a virtual buffer maps to. There is
- no I/O TLB or any scatter/gather mechanism. All buffers, whether input
- or output, must reside in a physically contiguous region of memory.
- The Hypervisor translates all addresses within a CCB to physical
- before handing off the CCB to DAX. The Hypervisor determines the
- virtual page size for each virtual address given, and uses this to
- program a size limit for each address. This prevents the coprocessor
- from reading or writing beyond the bound of the virtual page, even
- though it is accessing physical memory directly. A simpler way of
- saying this is that a DAX operation will never "cross" a virtual page
- boundary. If an 8k virtual page is used, then the data is strictly
- limited to 8k. If a user's buffer is larger than 8k, then a larger
- page size must be used, or the transaction size will be truncated to
- 8k.
- Huge pages. A user may allocate huge pages using standard interfaces.
- Memory buffers residing on huge pages may be used to achieve much
- larger DAX transaction sizes, but the rules must still be followed,
- and no transaction will cross a page boundary, even a huge page. A
- major caveat is that Linux on Sparc presents 8Mb as one of the huge
- page sizes. Sparc does not actually provide a 8Mb hardware page size,
- and this size is synthesized by pasting together two 4Mb pages. The
- reasons for this are historical, and it creates an issue because only
- half of this 8Mb page can actually be used for any given buffer in a
- DAX request, and it must be either the first half or the second half;
- it cannot be a 4Mb chunk in the middle, since that crosses a
- (hardware) page boundary. Note that this entire issue may be hidden by
- higher level libraries.
- CCB Structure
- -------------
- A CCB is an array of 8 64-bit words. Several of these words provide
- command opcodes, parameters, flags, etc., and the rest are addresses
- for the completion area, output buffer, and various inputs::
- struct ccb {
- u64 control;
- u64 completion;
- u64 input0;
- u64 access;
- u64 input1;
- u64 op_data;
- u64 output;
- u64 table;
- };
- See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
- each of these fields, and see dax-hv-api.txt for a complete description
- of the Hypervisor API available to the guest OS (ie, Linux kernel).
- The first word (control) is examined by the driver for the following:
- - CCB version, which must be consistent with hardware version
- - Opcode, which must be one of the documented allowable commands
- - Address types, which must be set to "virtual" for all the addresses
- given by the user, thereby ensuring that the application can
- only access memory that it owns
- Example Code
- ============
- The DAX is accessible to both user and kernel code. The kernel code
- can make hypercalls directly while the user code must use wrappers
- provided by the driver. The setup of the CCB is nearly identical for
- both; the only difference is in preparation of the completion area. An
- example of user code is given now, with kernel code afterwards.
- In order to program using the driver API, the file
- arch/sparc/include/uapi/asm/oradax.h must be included.
- First, the proper device must be opened. For M7 it will be
- /dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
- procedure is to attempt to open both, as only one will succeed::
- fd = open("/dev/oradax1", O_RDWR);
- if (fd < 0)
- fd = open("/dev/oradax2", O_RDWR);
- if (fd < 0)
- /* No DAX found */
- Next, the completion area must be mapped::
- completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);
- All input and output buffers must be fully contained in one hardware
- page, since as explained above, the DAX is strictly constrained by
- virtual page boundaries. In addition, the output buffer must be
- 64-byte aligned and its size must be a multiple of 64 bytes because
- the coprocessor writes in units of cache lines.
- This example demonstrates the DAX Scan command, which takes as input a
- vector and a match value, and produces a bitmap as the output. For
- each input element that matches the value, the corresponding bit is
- set in the output.
- In this example, the input vector consists of a series of single bits,
- and the match value is 0. So each 0 bit in the input will produce a 1
- in the output, and vice versa, which produces an output bitmap which
- is the input bitmap inverted.
- For details of all the parameters and bits used in this CCB, please
- refer to section 36.2.1.3 of the DAX Hypervisor API document, which
- describes the Scan command in detail::
- ccb->control = /* Table 36.1, CCB Header Format */
- (2L << 48) /* command = Scan Value */
- | (3L << 40) /* output address type = primary virtual */
- | (3L << 34) /* primary input address type = primary virtual */
- /* Section 36.2.1, Query CCB Command Formats */
- | (1 << 28) /* 36.2.1.1.1 primary input format = fixed width bit packed */
- | (0 << 23) /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
- | (8 << 10) /* 36.2.1.1.6 output format = bit vector */
- | (0 << 5) /* 36.2.1.3 First scan criteria size = 0 (1 byte) */
- | (31 << 0); /* 36.2.1.3 Disable second scan criteria */
- ccb->completion = 0; /* Completion area address, to be filled in by driver */
- ccb->input0 = (unsigned long) input; /* primary input address */
- ccb->access = /* Section 36.2.1.2, Data Access Control */
- (2 << 24) /* Primary input length format = bits */
- | (nbits - 1); /* number of bits in primary input stream, minus 1 */
- ccb->input1 = 0; /* secondary input address, unused */
- ccb->op_data = 0; /* scan criteria (value to be matched) */
- ccb->output = (unsigned long) output; /* output address */
- ccb->table = 0; /* table address, unused */
- The CCB submission is a write() or pwrite() system call to the
- driver. If the call fails, then a read() must be used to retrieve the
- status::
- if (pwrite(fd, ccb, 64, 0) != 64) {
- struct ccb_exec_result status;
- read(fd, &status, sizeof(status));
- /* bail out */
- }
- After a successful submission of the CCB, the completion area may be
- polled to determine when the DAX is finished. Detailed information on
- the contents of the completion area can be found in section 36.2.2 of
- the DAX HV API document::
- while (1) {
- /* Monitored Load */
- __asm__ __volatile__("lduba [%1] 0x84, %0\n"
- : "=r" (status)
- : "r" (completion_area));
- if (status) /* 0 indicates command in progress */
- break;
- /* MWAIT */
- __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
- }
- A completion area status of 1 indicates successful completion of the
- CCB and validity of the output bitmap, which may be used immediately.
- All other non-zero values indicate error conditions which are
- described in section 36.2.2::
- if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
- /* completion_area[0] contains the completion status */
- /* completion_area[1] contains an error code, see 36.2.2 */
- }
- After the completion area has been processed, the driver must be
- notified that it can release any resources associated with the
- request. This is done via the dequeue operation::
- struct dax_command cmd;
- cmd.command = CCB_DEQUEUE;
- if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
- /* bail out */
- }
- Finally, normal program cleanup should be done, i.e., unmapping
- completion area, closing the dax device, freeing memory etc.
- Kernel example
- --------------
- The only difference in using the DAX in kernel code is the treatment
- of the completion area. Unlike user applications which mmap the
- completion area allocated by the driver, kernel code must allocate its
- own memory to use for the completion area, and this address and its
- type must be given in the CCB::
- ccb->control |= /* Table 36.1, CCB Header Format */
- (3L << 32); /* completion area address type = primary virtual */
- ccb->completion = (unsigned long) completion_area; /* Completion area address */
- The dax submit hypercall is made directly. The flags used in the
- ccb_submit call are documented in the DAX HV API in section 36.3.1/
- ::
- #include <asm/hypervisor.h>
- hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
- HV_CCB_QUERY_CMD |
- HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY |
- HV_CCB_VA_PRIVILEGED,
- 0, &bytes_accepted, &status_data);
- if (hv_rv != HV_EOK) {
- /* hv_rv is an error code, status_data contains */
- /* potential additional status, see 36.3.1.1 */
- }
- After the submission, the completion area polling code is identical to
- that in user land::
- while (1) {
- /* Monitored Load */
- __asm__ __volatile__("lduba [%1] 0x84, %0\n"
- : "=r" (status)
- : "r" (completion_area));
- if (status) /* 0 indicates command in progress */
- break;
- /* MWAIT */
- __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
- }
- if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
- /* completion_area[0] contains the completion status */
- /* completion_area[1] contains an error code, see 36.2.2 */
- }
- The output bitmap is ready for consumption immediately after the
- completion status indicates success.
- Excer[t from UltraSPARC Virtual Machine Specification
- =====================================================
- .. include:: dax-hv-api.txt
- :literal:
|