doc: packet_mmap: update doc to implementation status
This improves the packet_mmap.txt document in the following ways: * Add initial information about different TPACKET versions * Add initial information about packet fanout * Add pointer to BPF document (since this also could be of interest) * 'Fix' minor, rather cosmetic things Information partially taken from related commit messages. Reported-by: Ronny Meeus <ronny.meeus@gmail.com> Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch> Cc: Ulisses Alonso Camaró <uaca@alumni.uv.es> Cc: Johann Baudy <johann.baudy@gnu-log.net> Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is contained in:

committed by
David S. Miller

parent
56277f40d7
commit
d1ee40f960
@@ -3,9 +3,9 @@
|
|||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
This file documents the mmap() facility available with the PACKET
|
This file documents the mmap() facility available with the PACKET
|
||||||
socket interface on 2.4 and 2.6 kernels. This type of sockets is used for
|
socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
|
||||||
capture network traffic with utilities like tcpdump or any other that needs
|
i) capture network traffic with utilities like tcpdump, ii) transmit network
|
||||||
raw access to network interface.
|
traffic, or any other that needs raw access to network interface.
|
||||||
|
|
||||||
You can find the latest version of this document at:
|
You can find the latest version of this document at:
|
||||||
http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap
|
http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap
|
||||||
@@ -21,19 +21,18 @@ Please send your comments to
|
|||||||
+ Why use PACKET_MMAP
|
+ Why use PACKET_MMAP
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very
|
In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
|
||||||
inefficient. It uses very limited buffers and requires one system call
|
inefficient. It uses very limited buffers and requires one system call to
|
||||||
to capture each packet, it requires two if you want to get packet's
|
capture each packet, it requires two if you want to get packet's timestamp
|
||||||
timestamp (like libpcap always does).
|
(like libpcap always does).
|
||||||
|
|
||||||
In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
|
In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
|
||||||
configurable circular buffer mapped in user space that can be used to either
|
configurable circular buffer mapped in user space that can be used to either
|
||||||
send or receive packets. This way reading packets just needs to wait for them,
|
send or receive packets. This way reading packets just needs to wait for them,
|
||||||
most of the time there is no need to issue a single system call. Concerning
|
most of the time there is no need to issue a single system call. Concerning
|
||||||
transmission, multiple packets can be sent through one system call to get the
|
transmission, multiple packets can be sent through one system call to get the
|
||||||
highest bandwidth.
|
highest bandwidth. By using a shared buffer between the kernel and the user
|
||||||
By using a shared buffer between the kernel and the user also has the benefit
|
also has the benefit of minimizing packet copies.
|
||||||
of minimizing packet copies.
|
|
||||||
|
|
||||||
It's fine to use PACKET_MMAP to improve the performance of the capture and
|
It's fine to use PACKET_MMAP to improve the performance of the capture and
|
||||||
transmission process, but it isn't everything. At least, if you are capturing
|
transmission process, but it isn't everything. At least, if you are capturing
|
||||||
@@ -41,7 +40,8 @@ at high speeds (this is relative to the cpu speed), you should check if the
|
|||||||
device driver of your network interface card supports some sort of interrupt
|
device driver of your network interface card supports some sort of interrupt
|
||||||
load mitigation or (even better) if it supports NAPI, also make sure it is
|
load mitigation or (even better) if it supports NAPI, also make sure it is
|
||||||
enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
|
enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
|
||||||
supported by devices of your network.
|
supported by devices of your network. CPU IRQ pinning of your network interface
|
||||||
|
card can also be an advantage.
|
||||||
|
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
+ How to use mmap() to improve capture process
|
+ How to use mmap() to improve capture process
|
||||||
@@ -87,9 +87,7 @@ the following process:
|
|||||||
socket creation and destruction is straight forward, and is done
|
socket creation and destruction is straight forward, and is done
|
||||||
the same way with or without PACKET_MMAP:
|
the same way with or without PACKET_MMAP:
|
||||||
|
|
||||||
int fd;
|
int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
|
||||||
|
|
||||||
fd= socket(PF_PACKET, mode, htons(ETH_P_ALL))
|
|
||||||
|
|
||||||
where mode is SOCK_RAW for the raw interface were link level
|
where mode is SOCK_RAW for the raw interface were link level
|
||||||
information can be captured or SOCK_DGRAM for the cooked
|
information can be captured or SOCK_DGRAM for the cooked
|
||||||
@@ -180,7 +178,6 @@ and the PACKET_TX_HAS_OFF option.
|
|||||||
+ PACKET_MMAP settings
|
+ PACKET_MMAP settings
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
To setup PACKET_MMAP from user level code is done with a call like
|
To setup PACKET_MMAP from user level code is done with a call like
|
||||||
|
|
||||||
- Capture process
|
- Capture process
|
||||||
@@ -214,7 +211,6 @@ indeed, packet_set_ring checks that the following condition is true
|
|||||||
|
|
||||||
frames_per_block * tp_block_nr == tp_frame_nr
|
frames_per_block * tp_block_nr == tp_frame_nr
|
||||||
|
|
||||||
|
|
||||||
Lets see an example, with the following values:
|
Lets see an example, with the following values:
|
||||||
|
|
||||||
tp_block_size= 4096
|
tp_block_size= 4096
|
||||||
@@ -240,7 +236,6 @@ be spawned across two blocks, so there are some details you have to take into
|
|||||||
account when choosing the frame_size. See "Mapping and use of the circular
|
account when choosing the frame_size. See "Mapping and use of the circular
|
||||||
buffer (ring)".
|
buffer (ring)".
|
||||||
|
|
||||||
|
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
+ PACKET_MMAP setting constraints
|
+ PACKET_MMAP setting constraints
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
@@ -277,7 +272,6 @@ User space programs can include /usr/include/sys/user.h and
|
|||||||
The pagesize can also be determined dynamically with the getpagesize (2)
|
The pagesize can also be determined dynamically with the getpagesize (2)
|
||||||
system call.
|
system call.
|
||||||
|
|
||||||
|
|
||||||
Block number limit
|
Block number limit
|
||||||
--------------------
|
--------------------
|
||||||
|
|
||||||
@@ -297,7 +291,6 @@ called pg_vec, its size limits the number of blocks that can be allocated.
|
|||||||
v block #2
|
v block #2
|
||||||
block #1
|
block #1
|
||||||
|
|
||||||
|
|
||||||
kmalloc allocates any number of bytes of physically contiguous memory from
|
kmalloc allocates any number of bytes of physically contiguous memory from
|
||||||
a pool of pre-determined sizes. This pool of memory is maintained by the slab
|
a pool of pre-determined sizes. This pool of memory is maintained by the slab
|
||||||
allocator which is at the end the responsible for doing the allocation and
|
allocator which is at the end the responsible for doing the allocation and
|
||||||
@@ -312,7 +305,6 @@ pointers to blocks is
|
|||||||
|
|
||||||
131072/4 = 32768 blocks
|
131072/4 = 32768 blocks
|
||||||
|
|
||||||
|
|
||||||
PACKET_MMAP buffer size calculator
|
PACKET_MMAP buffer size calculator
|
||||||
------------------------------------
|
------------------------------------
|
||||||
|
|
||||||
@@ -353,7 +345,6 @@ and a value for <frame size> of 2048 bytes. These parameters will yield
|
|||||||
and hence the buffer will have a 262144 MiB size. So it can hold
|
and hence the buffer will have a 262144 MiB size. So it can hold
|
||||||
262144 MiB / 2048 bytes = 134217728 frames
|
262144 MiB / 2048 bytes = 134217728 frames
|
||||||
|
|
||||||
|
|
||||||
Actually, this buffer size is not possible with an i386 architecture.
|
Actually, this buffer size is not possible with an i386 architecture.
|
||||||
Remember that the memory is allocated in kernel space, in the case of
|
Remember that the memory is allocated in kernel space, in the case of
|
||||||
an i386 kernel's memory size is limited to 1GiB.
|
an i386 kernel's memory size is limited to 1GiB.
|
||||||
@@ -385,7 +376,6 @@ the following (from include/linux/if_packet.h):
|
|||||||
- Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
|
- Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
|
||||||
- Pad to align to TPACKET_ALIGNMENT=16
|
- Pad to align to TPACKET_ALIGNMENT=16
|
||||||
*/
|
*/
|
||||||
|
|
||||||
|
|
||||||
The following are conditions that are checked in packet_set_ring
|
The following are conditions that are checked in packet_set_ring
|
||||||
|
|
||||||
@@ -426,7 +416,6 @@ and the following flags apply:
|
|||||||
#define TP_STATUS_LOSING 4
|
#define TP_STATUS_LOSING 4
|
||||||
#define TP_STATUS_CSUMNOTREADY 8
|
#define TP_STATUS_CSUMNOTREADY 8
|
||||||
|
|
||||||
|
|
||||||
TP_STATUS_COPY : This flag indicates that the frame (and associated
|
TP_STATUS_COPY : This flag indicates that the frame (and associated
|
||||||
meta information) has been truncated because it's
|
meta information) has been truncated because it's
|
||||||
larger than tp_frame_size. This packet can be
|
larger than tp_frame_size. This packet can be
|
||||||
@@ -475,7 +464,6 @@ packets are in the ring:
|
|||||||
It doesn't incur in a race condition to first check the status value and
|
It doesn't incur in a race condition to first check the status value and
|
||||||
then poll for frames.
|
then poll for frames.
|
||||||
|
|
||||||
|
|
||||||
++ Transmission process
|
++ Transmission process
|
||||||
Those defines are also used for transmission:
|
Those defines are also used for transmission:
|
||||||
|
|
||||||
@@ -506,6 +494,196 @@ The user can also use poll() to check if a buffer is available:
|
|||||||
pfd.events = POLLOUT;
|
pfd.events = POLLOUT;
|
||||||
retval = poll(&pfd, 1, timeout);
|
retval = poll(&pfd, 1, timeout);
|
||||||
|
|
||||||
|
-------------------------------------------------------------------------------
|
||||||
|
+ What TPACKET versions are available and when to use them?
|
||||||
|
-------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
int val = tpacket_version;
|
||||||
|
setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
|
||||||
|
getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
|
||||||
|
|
||||||
|
where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
|
||||||
|
|
||||||
|
TPACKET_V1:
|
||||||
|
- Default if not otherwise specified by setsockopt(2)
|
||||||
|
- RX_RING, TX_RING available
|
||||||
|
- VLAN metadata information available for packets
|
||||||
|
(TP_STATUS_VLAN_VALID)
|
||||||
|
|
||||||
|
TPACKET_V1 --> TPACKET_V2:
|
||||||
|
- Made 64 bit clean due to unsigned long usage in TPACKET_V1
|
||||||
|
structures, thus this also works on 64 bit kernel with 32 bit
|
||||||
|
userspace and the like
|
||||||
|
- Timestamp resolution in nanoseconds instead of microseconds
|
||||||
|
- RX_RING, TX_RING available
|
||||||
|
- How to switch to TPACKET_V2:
|
||||||
|
1. Replace struct tpacket_hdr by struct tpacket2_hdr
|
||||||
|
2. Query header len and save
|
||||||
|
3. Set protocol version to 2, set up ring as usual
|
||||||
|
4. For getting the sockaddr_ll,
|
||||||
|
use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of
|
||||||
|
(void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
|
||||||
|
|
||||||
|
TPACKET_V2 --> TPACKET_V3:
|
||||||
|
- Flexible buffer implementation:
|
||||||
|
1. Blocks can be configured with non-static frame-size
|
||||||
|
2. Read/poll is at a block-level (as opposed to packet-level)
|
||||||
|
3. Added poll timeout to avoid indefinite user-space wait
|
||||||
|
on idle links
|
||||||
|
4. Added user-configurable knobs:
|
||||||
|
4.1 block::timeout
|
||||||
|
4.2 tpkt_hdr::sk_rxhash
|
||||||
|
- RX Hash data available in user space
|
||||||
|
- Currently only RX_RING available
|
||||||
|
|
||||||
|
-------------------------------------------------------------------------------
|
||||||
|
+ AF_PACKET fanout mode
|
||||||
|
-------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
In the AF_PACKET fanout mode, packet reception can be load balanced among
|
||||||
|
processes. This also works in combination with mmap(2) on packet sockets.
|
||||||
|
|
||||||
|
Minimal example code by David S. Miller (try things like "./test eth0 hash",
|
||||||
|
"./test eth0 lb", etc.):
|
||||||
|
|
||||||
|
#include <stddef.h>
|
||||||
|
#include <stdlib.h>
|
||||||
|
#include <stdio.h>
|
||||||
|
#include <string.h>
|
||||||
|
|
||||||
|
#include <sys/types.h>
|
||||||
|
#include <sys/wait.h>
|
||||||
|
#include <sys/socket.h>
|
||||||
|
#include <sys/ioctl.h>
|
||||||
|
|
||||||
|
#include <unistd.h>
|
||||||
|
|
||||||
|
#include <linux/if_ether.h>
|
||||||
|
#include <linux/if_packet.h>
|
||||||
|
|
||||||
|
#include <net/if.h>
|
||||||
|
|
||||||
|
static const char *device_name;
|
||||||
|
static int fanout_type;
|
||||||
|
static int fanout_id;
|
||||||
|
|
||||||
|
#ifndef PACKET_FANOUT
|
||||||
|
# define PACKET_FANOUT 18
|
||||||
|
# define PACKET_FANOUT_HASH 0
|
||||||
|
# define PACKET_FANOUT_LB 1
|
||||||
|
#endif
|
||||||
|
|
||||||
|
static int setup_socket(void)
|
||||||
|
{
|
||||||
|
int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
|
||||||
|
struct sockaddr_ll ll;
|
||||||
|
struct ifreq ifr;
|
||||||
|
int fanout_arg;
|
||||||
|
|
||||||
|
if (fd < 0) {
|
||||||
|
perror("socket");
|
||||||
|
return EXIT_FAILURE;
|
||||||
|
}
|
||||||
|
|
||||||
|
memset(&ifr, 0, sizeof(ifr));
|
||||||
|
strcpy(ifr.ifr_name, device_name);
|
||||||
|
err = ioctl(fd, SIOCGIFINDEX, &ifr);
|
||||||
|
if (err < 0) {
|
||||||
|
perror("SIOCGIFINDEX");
|
||||||
|
return EXIT_FAILURE;
|
||||||
|
}
|
||||||
|
|
||||||
|
memset(&ll, 0, sizeof(ll));
|
||||||
|
ll.sll_family = AF_PACKET;
|
||||||
|
ll.sll_ifindex = ifr.ifr_ifindex;
|
||||||
|
err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
|
||||||
|
if (err < 0) {
|
||||||
|
perror("bind");
|
||||||
|
return EXIT_FAILURE;
|
||||||
|
}
|
||||||
|
|
||||||
|
fanout_arg = (fanout_id | (fanout_type << 16));
|
||||||
|
err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
|
||||||
|
&fanout_arg, sizeof(fanout_arg));
|
||||||
|
if (err) {
|
||||||
|
perror("setsockopt");
|
||||||
|
return EXIT_FAILURE;
|
||||||
|
}
|
||||||
|
|
||||||
|
return fd;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void fanout_thread(void)
|
||||||
|
{
|
||||||
|
int fd = setup_socket();
|
||||||
|
int limit = 10000;
|
||||||
|
|
||||||
|
if (fd < 0)
|
||||||
|
exit(fd);
|
||||||
|
|
||||||
|
while (limit-- > 0) {
|
||||||
|
char buf[1600];
|
||||||
|
int err;
|
||||||
|
|
||||||
|
err = read(fd, buf, sizeof(buf));
|
||||||
|
if (err < 0) {
|
||||||
|
perror("read");
|
||||||
|
exit(EXIT_FAILURE);
|
||||||
|
}
|
||||||
|
if ((limit % 10) == 0)
|
||||||
|
fprintf(stdout, "(%d) \n", getpid());
|
||||||
|
}
|
||||||
|
|
||||||
|
fprintf(stdout, "%d: Received 10000 packets\n", getpid());
|
||||||
|
|
||||||
|
close(fd);
|
||||||
|
exit(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
int main(int argc, char **argp)
|
||||||
|
{
|
||||||
|
int fd, err;
|
||||||
|
int i;
|
||||||
|
|
||||||
|
if (argc != 3) {
|
||||||
|
fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
|
||||||
|
return EXIT_FAILURE;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!strcmp(argp[2], "hash"))
|
||||||
|
fanout_type = PACKET_FANOUT_HASH;
|
||||||
|
else if (!strcmp(argp[2], "lb"))
|
||||||
|
fanout_type = PACKET_FANOUT_LB;
|
||||||
|
else {
|
||||||
|
fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
|
||||||
|
exit(EXIT_FAILURE);
|
||||||
|
}
|
||||||
|
|
||||||
|
device_name = argp[1];
|
||||||
|
fanout_id = getpid() & 0xffff;
|
||||||
|
|
||||||
|
for (i = 0; i < 4; i++) {
|
||||||
|
pid_t pid = fork();
|
||||||
|
|
||||||
|
switch (pid) {
|
||||||
|
case 0:
|
||||||
|
fanout_thread();
|
||||||
|
|
||||||
|
case -1:
|
||||||
|
perror("fork");
|
||||||
|
exit(EXIT_FAILURE);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
for (i = 0; i < 4; i++) {
|
||||||
|
int status;
|
||||||
|
|
||||||
|
wait(&status);
|
||||||
|
}
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
-------------------------------------------------------------------------------
|
-------------------------------------------------------------------------------
|
||||||
+ PACKET_TIMESTAMP
|
+ PACKET_TIMESTAMP
|
||||||
-------------------------------------------------------------------------------
|
-------------------------------------------------------------------------------
|
||||||
@@ -532,6 +710,13 @@ the networking stack is used (the behavior before this setting was added).
|
|||||||
See include/linux/net_tstamp.h and Documentation/networking/timestamping
|
See include/linux/net_tstamp.h and Documentation/networking/timestamping
|
||||||
for more information on hardware timestamps.
|
for more information on hardware timestamps.
|
||||||
|
|
||||||
|
-------------------------------------------------------------------------------
|
||||||
|
+ Miscellaneous bits
|
||||||
|
-------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
- Packet sockets work well together with Linux socket filters, thus you also
|
||||||
|
might want to have a look at Documentation/networking/filter.txt
|
||||||
|
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
+ THANKS
|
+ THANKS
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
Reference in New Issue
Block a user