Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (867 commits) [SKY2]: status polling loop (post merge) [NET]: Fix NAPI completion handling in some drivers. [TCP]: Limit processing lost_retrans loop to work-to-do cases [TCP]: Fix lost_retrans loop vs fastpath problems [TCP]: No need to re-count fackets_out/sacked_out at RTO [TCP]: Extract tcp_match_queue_to_sack from sacktag code [TCP]: Kill almost unused variable pcount from sacktag [TCP]: Fix mark_head_lost to ignore R-bit when trying to mark L [TCP]: Add bytes_acked (ABC) clearing to FRTO too [IPv6]: Update setsockopt(IPV6_MULTICAST_IF) to support RFC 3493, try2 [NETFILTER]: x_tables: add missing ip6t_modulename aliases [NETFILTER]: nf_conntrack_tcp: fix connection reopening [QETH]: fix qeth_main.c [NETLINK]: fib_frontend build fixes [IPv6]: Export userland ND options through netlink (RDNSS support) [9P]: build fix with !CONFIG_SYSCTL [NET]: Fix dev_put() and dev_hold() comments [NET]: make netlink user -> kernel interface synchronious [NET]: unify netlink kernel socket recognition [NET]: cleanup 3rd argument in netlink_sendskb ... Fix up conflicts manually in Documentation/feature-removal-schedule.txt and my new least favourite crap, the "mod_devicetable" support in the files include/linux/mod_devicetable.h and scripts/mod/file2alias.c. (The latter files seem to be explicitly _designed_ to get conflicts when different subsystems work with them - that have an absolutely horrid lack of subsystem separation!) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
这个提交包含在:
@@ -240,17 +240,23 @@ X!Ilib/string.c
|
||||
<sect1><title>Driver Support</title>
|
||||
!Enet/core/dev.c
|
||||
!Enet/ethernet/eth.c
|
||||
!Enet/sched/sch_generic.c
|
||||
!Iinclude/linux/etherdevice.h
|
||||
!Iinclude/linux/netdevice.h
|
||||
</sect1>
|
||||
<sect1><title>PHY Support</title>
|
||||
!Edrivers/net/phy/phy.c
|
||||
!Idrivers/net/phy/phy.c
|
||||
!Edrivers/net/phy/phy_device.c
|
||||
!Idrivers/net/phy/phy_device.c
|
||||
!Edrivers/net/phy/mdio_bus.c
|
||||
!Idrivers/net/phy/mdio_bus.c
|
||||
<!-- FIXME: Removed for now since no structured comments in source
|
||||
X!Enet/core/wireless.c
|
||||
-->
|
||||
</sect1>
|
||||
<!-- FIXME: Removed for now since no structured comments in source
|
||||
<sect1><title>Wireless</title>
|
||||
X!Enet/core/wireless.c
|
||||
</sect1>
|
||||
-->
|
||||
<sect1><title>Synchronous PPP</title>
|
||||
!Edrivers/net/wan/syncppp.c
|
||||
</sect1>
|
||||
|
@@ -314,3 +314,16 @@ Why: The i386/x86_64 merge provides a symlink to the old bzImage
|
||||
location so not yet updated user space tools, e.g. package
|
||||
scripts, do not break.
|
||||
Who: Thomas Gleixner <tglx@linutronix.de>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: shaper network driver
|
||||
When: January 2008
|
||||
Files: drivers/net/shaper.c, include/linux/if_shaper.h
|
||||
Why: This driver has been marked obsolete for many years.
|
||||
It was only designed to work on lower speed links and has design
|
||||
flaws that lead to machine crashes. The qdisc infrastructure in
|
||||
2.4 or later kernels, provides richer features and is more robust.
|
||||
Who: Stephen Hemminger <shemminger@linux-foundation.org>
|
||||
|
||||
---------------------------
|
||||
|
@@ -1,766 +0,0 @@
|
||||
HISTORY:
|
||||
February 16/2002 -- revision 0.2.1:
|
||||
COR typo corrected
|
||||
February 10/2002 -- revision 0.2:
|
||||
some spell checking ;->
|
||||
January 12/2002 -- revision 0.1
|
||||
This is still work in progress so may change.
|
||||
To keep up to date please watch this space.
|
||||
|
||||
Introduction to NAPI
|
||||
====================
|
||||
|
||||
NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique
|
||||
to improve network performance on Linux. For more details please
|
||||
read that paper.
|
||||
NAPI provides a "inherent mitigation" which is bound by system capacity
|
||||
as can be seen from the following data collected by Robert on Gigabit
|
||||
ethernet (e1000):
|
||||
|
||||
Psize Ipps Tput Rxint Txint Done Ndone
|
||||
---------------------------------------------------------------
|
||||
60 890000 409362 17 27622 7 6823
|
||||
128 758150 464364 21 9301 10 7738
|
||||
256 445632 774646 42 15507 21 12906
|
||||
512 232666 994445 241292 19147 241192 1062
|
||||
1024 119061 1000003 872519 19258 872511 0
|
||||
1440 85193 1000003 946576 19505 946569 0
|
||||
|
||||
|
||||
Legend:
|
||||
"Ipps" stands for input packets per second.
|
||||
"Tput" == packets out of total 1M that made it out.
|
||||
"txint" == transmit completion interrupts seen
|
||||
"Done" == The number of times that the poll() managed to pull all
|
||||
packets out of the rx ring. Note from this that the lower the
|
||||
load the more we could clean up the rxring
|
||||
"Ndone" == is the converse of "Done". Note again, that the higher
|
||||
the load the more times we couldn't clean up the rxring.
|
||||
|
||||
Observe that:
|
||||
when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated.
|
||||
The system cant handle the processing at 1 interrupt/packet at that load level.
|
||||
At lower rates on the other hand, rx interrupts go up and therefore the
|
||||
interrupt/packet ratio goes up (as observable from that table). So there is
|
||||
possibility that under low enough input, you get one poll call for each
|
||||
input packet caused by a single interrupt each time. And if the system
|
||||
cant handle interrupt per packet ratio of 1, then it will just have to
|
||||
chug along ....
|
||||
|
||||
|
||||
0) Prerequisites:
|
||||
==================
|
||||
A driver MAY continue using the old 2.4 technique for interfacing
|
||||
to the network stack and not benefit from the NAPI changes.
|
||||
NAPI additions to the kernel do not break backward compatibility.
|
||||
NAPI, however, requires the following features to be available:
|
||||
|
||||
A) DMA ring or enough RAM to store packets in software devices.
|
||||
|
||||
B) Ability to turn off interrupts or maybe events that send packets up
|
||||
the stack.
|
||||
|
||||
NAPI processes packet events in what is known as dev->poll() method.
|
||||
Typically, only packet receive events are processed in dev->poll().
|
||||
The rest of the events MAY be processed by the regular interrupt handler
|
||||
to reduce processing latency (justified also because there are not that
|
||||
many of them).
|
||||
Note, however, NAPI does not enforce that dev->poll() only processes
|
||||
receive events.
|
||||
Tests with the tulip driver indicated slightly increased latency if
|
||||
all of the interrupt handler is moved to dev->poll(). Also MII handling
|
||||
gets a little trickier.
|
||||
The example used in this document is to move the receive processing only
|
||||
to dev->poll(); this is shown with the patch for the tulip driver.
|
||||
For an example of code that moves all the interrupt driver to
|
||||
dev->poll() look at the ported e1000 code.
|
||||
|
||||
There are caveats that might force you to go with moving everything to
|
||||
dev->poll(). Different NICs work differently depending on their status/event
|
||||
acknowledgement setup.
|
||||
There are two types of event register ACK mechanisms.
|
||||
I) what is known as Clear-on-read (COR).
|
||||
when you read the status/event register, it clears everything!
|
||||
The natsemi and sunbmac NICs are known to do this.
|
||||
In this case your only choice is to move all to dev->poll()
|
||||
|
||||
II) Clear-on-write (COW)
|
||||
i) you clear the status by writing a 1 in the bit-location you want.
|
||||
These are the majority of the NICs and work the best with NAPI.
|
||||
Put only receive events in dev->poll(); leave the rest in
|
||||
the old interrupt handler.
|
||||
ii) whatever you write in the status register clears every thing ;->
|
||||
Cant seem to find any supported by Linux which do this. If
|
||||
someone knows such a chip email us please.
|
||||
Move all to dev->poll()
|
||||
|
||||
C) Ability to detect new work correctly.
|
||||
NAPI works by shutting down event interrupts when there's work and
|
||||
turning them on when there's none.
|
||||
New packets might show up in the small window while interrupts were being
|
||||
re-enabled (refer to appendix 2). A packet might sneak in during the period
|
||||
we are enabling interrupts. We only get to know about such a packet when the
|
||||
next new packet arrives and generates an interrupt.
|
||||
Essentially, there is a small window of opportunity for a race condition
|
||||
which for clarity we'll refer to as the "rotting packet".
|
||||
|
||||
This is a very important topic and appendix 2 is dedicated for more
|
||||
discussion.
|
||||
|
||||
Locking rules and environmental guarantees
|
||||
==========================================
|
||||
|
||||
-Guarantee: Only one CPU at any time can call dev->poll(); this is because
|
||||
only one CPU can pick the initial interrupt and hence the initial
|
||||
netif_rx_schedule(dev);
|
||||
- The core layer invokes devices to send packets in a round robin format.
|
||||
This implies receive is totally lockless because of the guarantee that only
|
||||
one CPU is executing it.
|
||||
- contention can only be the result of some other CPU accessing the rx
|
||||
ring. This happens only in close() and suspend() (when these methods
|
||||
try to clean the rx ring);
|
||||
****guarantee: driver authors need not worry about this; synchronization
|
||||
is taken care for them by the top net layer.
|
||||
-local interrupts are enabled (if you dont move all to dev->poll()). For
|
||||
example link/MII and txcomplete continue functioning just same old way.
|
||||
This improves the latency of processing these events. It is also assumed that
|
||||
the receive interrupt is the largest cause of noise. Note this might not
|
||||
always be true.
|
||||
[according to Manfred Spraul, the winbond insists on sending one
|
||||
txmitcomplete interrupt for each packet (although this can be mitigated)].
|
||||
For these broken drivers, move all to dev->poll().
|
||||
|
||||
For the rest of this text, we'll assume that dev->poll() only
|
||||
processes receive events.
|
||||
|
||||
new methods introduce by NAPI
|
||||
=============================
|
||||
|
||||
a) netif_rx_schedule(dev)
|
||||
Called by an IRQ handler to schedule a poll for device
|
||||
|
||||
b) netif_rx_schedule_prep(dev)
|
||||
puts the device in a state which allows for it to be added to the
|
||||
CPU polling list if it is up and running. You can look at this as
|
||||
the first half of netif_rx_schedule(dev) above; the second half
|
||||
being c) below.
|
||||
|
||||
c) __netif_rx_schedule(dev)
|
||||
Add device to the poll list for this CPU; assuming that _prep above
|
||||
has already been called and returned 1.
|
||||
|
||||
d) netif_rx_reschedule(dev, undo)
|
||||
Called to reschedule polling for device specifically for some
|
||||
deficient hardware. Read Appendix 2 for more details.
|
||||
|
||||
e) netif_rx_complete(dev)
|
||||
|
||||
Remove interface from the CPU poll list: it must be in the poll list
|
||||
on current cpu. This primitive is called by dev->poll(), when
|
||||
it completes its work. The device cannot be out of poll list at this
|
||||
call, if it is then clearly it is a BUG(). You'll know ;->
|
||||
|
||||
All of the above methods are used below, so keep reading for clarity.
|
||||
|
||||
Device driver changes to be made when porting NAPI
|
||||
==================================================
|
||||
|
||||
Below we describe what kind of changes are required for NAPI to work.
|
||||
|
||||
1) introduction of dev->poll() method
|
||||
=====================================
|
||||
|
||||
This is the method that is invoked by the network core when it requests
|
||||
for new packets from the driver. A driver is allowed to send upto
|
||||
dev->quota packets by the current CPU before yielding to the network
|
||||
subsystem (so other devices can also get opportunity to send to the stack).
|
||||
|
||||
dev->poll() prototype looks as follows:
|
||||
int my_poll(struct net_device *dev, int *budget)
|
||||
|
||||
budget is the remaining number of packets the network subsystem on the
|
||||
current CPU can send up the stack before yielding to other system tasks.
|
||||
*Each driver is responsible for decrementing budget by the total number of
|
||||
packets sent.
|
||||
Total number of packets cannot exceed dev->quota.
|
||||
|
||||
dev->poll() method is invoked by the top layer, the driver just sends if it
|
||||
can to the stack the packet quantity requested.
|
||||
|
||||
more on dev->poll() below after the interrupt changes are explained.
|
||||
|
||||
2) registering dev->poll() method
|
||||
===================================
|
||||
|
||||
dev->poll should be set in the dev->probe() method.
|
||||
e.g:
|
||||
dev->open = my_open;
|
||||
.
|
||||
.
|
||||
/* two new additions */
|
||||
/* first register my poll method */
|
||||
dev->poll = my_poll;
|
||||
/* next register my weight/quanta; can be overridden in /proc */
|
||||
dev->weight = 16;
|
||||
.
|
||||
.
|
||||
dev->stop = my_close;
|
||||
|
||||
|
||||
|
||||
3) scheduling dev->poll()
|
||||
=============================
|
||||
This involves modifying the interrupt handler and the code
|
||||
path which takes the packet off the NIC and sends them to the
|
||||
stack.
|
||||
|
||||
it's important at this point to introduce the classical D Becker
|
||||
interrupt processor:
|
||||
|
||||
------------------
|
||||
static irqreturn_t
|
||||
netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
|
||||
{
|
||||
|
||||
struct net_device *dev = (struct net_device *)dev_instance;
|
||||
struct my_private *tp = (struct my_private *)dev->priv;
|
||||
|
||||
int work_count = my_work_count;
|
||||
status = read_interrupt_status_reg();
|
||||
if (status == 0)
|
||||
return IRQ_NONE; /* Shared IRQ: not us */
|
||||
if (status == 0xffff)
|
||||
return IRQ_HANDLED; /* Hot unplug */
|
||||
if (status & error)
|
||||
do_some_error_handling()
|
||||
|
||||
do {
|
||||
acknowledge_ints_ASAP();
|
||||
|
||||
if (status & link_interrupt) {
|
||||
spin_lock(&tp->link_lock);
|
||||
do_some_link_stat_stuff();
|
||||
spin_lock(&tp->link_lock);
|
||||
}
|
||||
|
||||
if (status & rx_interrupt) {
|
||||
receive_packets(dev);
|
||||
}
|
||||
|
||||
if (status & rx_nobufs) {
|
||||
make_rx_buffs_avail();
|
||||
}
|
||||
|
||||
if (status & tx_related) {
|
||||
spin_lock(&tp->lock);
|
||||
tx_ring_free(dev);
|
||||
if (tx_died)
|
||||
restart_tx();
|
||||
spin_unlock(&tp->lock);
|
||||
}
|
||||
|
||||
status = read_interrupt_status_reg();
|
||||
|
||||
} while (!(status & error) || more_work_to_be_done);
|
||||
return IRQ_HANDLED;
|
||||
}
|
||||
|
||||
----------------------------------------------------------------------
|
||||
|
||||
We now change this to what is shown below to NAPI-enable it:
|
||||
|
||||
----------------------------------------------------------------------
|
||||
static irqreturn_t
|
||||
netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
|
||||
{
|
||||
struct net_device *dev = (struct net_device *)dev_instance;
|
||||
struct my_private *tp = (struct my_private *)dev->priv;
|
||||
|
||||
status = read_interrupt_status_reg();
|
||||
if (status == 0)
|
||||
return IRQ_NONE; /* Shared IRQ: not us */
|
||||
if (status == 0xffff)
|
||||
return IRQ_HANDLED; /* Hot unplug */
|
||||
if (status & error)
|
||||
do_some_error_handling();
|
||||
|
||||
do {
|
||||
/************************ start note *********************************/
|
||||
acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here
|
||||
/************************ end note *********************************/
|
||||
|
||||
if (status & link_interrupt) {
|
||||
spin_lock(&tp->link_lock);
|
||||
do_some_link_stat_stuff();
|
||||
spin_unlock(&tp->link_lock);
|
||||
}
|
||||
/************************ start note *********************************/
|
||||
if (status & rx_interrupt || (status & rx_nobuffs)) {
|
||||
if (netif_rx_schedule_prep(dev)) {
|
||||
|
||||
/* disable interrupts caused
|
||||
* by arriving packets */
|
||||
disable_rx_and_rxnobuff_ints();
|
||||
/* tell system we have work to be done. */
|
||||
__netif_rx_schedule(dev);
|
||||
} else {
|
||||
printk("driver bug! interrupt while in poll\n");
|
||||
/* FIX by disabling interrupts */
|
||||
disable_rx_and_rxnobuff_ints();
|
||||
}
|
||||
}
|
||||
/************************ end note note *********************************/
|
||||
|
||||
if (status & tx_related) {
|
||||
spin_lock(&tp->lock);
|
||||
tx_ring_free(dev);
|
||||
|
||||
if (tx_died)
|
||||
restart_tx();
|
||||
spin_unlock(&tp->lock);
|
||||
}
|
||||
|
||||
status = read_interrupt_status_reg();
|
||||
|
||||
/************************ start note *********************************/
|
||||
} while (!(status & error) || more_work_to_be_done(status));
|
||||
/************************ end note note *********************************/
|
||||
return IRQ_HANDLED;
|
||||
}
|
||||
|
||||
---------------------------------------------------------------------
|
||||
|
||||
|
||||
We note several things from above:
|
||||
|
||||
I) Any interrupt source which is caused by arriving packets is now
|
||||
turned off when it occurs. Depending on the hardware, there could be
|
||||
several reasons that arriving packets would cause interrupts; these are the
|
||||
interrupt sources we wish to avoid. The two common ones are a) a packet
|
||||
arriving (rxint) b) a packet arriving and finding no DMA buffers available
|
||||
(rxnobuff) .
|
||||
This means also acknowledge_ints_ASAP() will not clear the status
|
||||
register for those two items above; clearing is done in the place where
|
||||
proper work is done within NAPI; at the poll() and refill_rx_ring()
|
||||
discussed further below.
|
||||
netif_rx_schedule_prep() returns 1 if device is in running state and
|
||||
gets successfully added to the core poll list. If we get a zero value
|
||||
we can _almost_ assume are already added to the list (instead of not running.
|
||||
Logic based on the fact that you shouldn't get interrupt if not running)
|
||||
We rectify this by disabling rx and rxnobuf interrupts.
|
||||
|
||||
II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared.
|
||||
These functionalities are still around actually......
|
||||
|
||||
infact, receive_packets(dev) is very close to my_poll() and
|
||||
make_rx_buffs_avail() is invoked from my_poll()
|
||||
|
||||
4) converting receive_packets() to dev->poll()
|
||||
===============================================
|
||||
|
||||
We need to convert the classical D Becker receive_packets(dev) to my_poll()
|
||||
|
||||
First the typical receive_packets() below:
|
||||
-------------------------------------------------------------------
|
||||
|
||||
/* this is called by interrupt handler */
|
||||
static void receive_packets (struct net_device *dev)
|
||||
{
|
||||
|
||||
struct my_private *tp = (struct my_private *)dev->priv;
|
||||
rx_ring = tp->rx_ring;
|
||||
cur_rx = tp->cur_rx;
|
||||
int entry = cur_rx % RX_RING_SIZE;
|
||||
int received = 0;
|
||||
int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx;
|
||||
|
||||
while (rx_ring_not_empty) {
|
||||
u32 rx_status;
|
||||
unsigned int rx_size;
|
||||
unsigned int pkt_size;
|
||||
struct sk_buff *skb;
|
||||
/* read size+status of next frame from DMA ring buffer */
|
||||
/* the number 16 and 4 are just examples */
|
||||
rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
|
||||
rx_size = rx_status >> 16;
|
||||
pkt_size = rx_size - 4;
|
||||
|
||||
/* process errors */
|
||||
if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
|
||||
(!(rx_status & RxStatusOK))) {
|
||||
netdrv_rx_err (rx_status, dev, tp, ioaddr);
|
||||
return;
|
||||
}
|
||||
|
||||
if (--rx_work_limit < 0)
|
||||
break;
|
||||
|
||||
/* grab a skb */
|
||||
skb = dev_alloc_skb (pkt_size + 2);
|
||||
if (skb) {
|
||||
.
|
||||
.
|
||||
netif_rx (skb);
|
||||
.
|
||||
.
|
||||
} else { /* OOM */
|
||||
/*seems very driver specific ... some just pass
|
||||
whatever is on the ring already. */
|
||||
}
|
||||
|
||||
/* move to the next skb on the ring */
|
||||
entry = (++tp->cur_rx) % RX_RING_SIZE;
|
||||
received++ ;
|
||||
|
||||
}
|
||||
|
||||
/* store current ring pointer state */
|
||||
tp->cur_rx = cur_rx;
|
||||
|
||||
/* Refill the Rx ring buffers if they are needed */
|
||||
refill_rx_ring();
|
||||
.
|
||||
.
|
||||
|
||||
}
|
||||
-------------------------------------------------------------------
|
||||
We change it to a new one below; note the additional parameter in
|
||||
the call.
|
||||
|
||||
-------------------------------------------------------------------
|
||||
|
||||
/* this is called by the network core */
|
||||
static int my_poll (struct net_device *dev, int *budget)
|
||||
{
|
||||
|
||||
struct my_private *tp = (struct my_private *)dev->priv;
|
||||
rx_ring = tp->rx_ring;
|
||||
cur_rx = tp->cur_rx;
|
||||
int entry = cur_rx % RX_BUF_LEN;
|
||||
/* maximum packets to send to the stack */
|
||||
/************************ note note *********************************/
|
||||
int rx_work_limit = dev->quota;
|
||||
|
||||
/************************ end note note *********************************/
|
||||
do { // outer beginning loop starts here
|
||||
|
||||
clear_rx_status_register_bit();
|
||||
|
||||
while (rx_ring_not_empty) {
|
||||
u32 rx_status;
|
||||
unsigned int rx_size;
|
||||
unsigned int pkt_size;
|
||||
struct sk_buff *skb;
|
||||
/* read size+status of next frame from DMA ring buffer */
|
||||
/* the number 16 and 4 are just examples */
|
||||
rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
|
||||
rx_size = rx_status >> 16;
|
||||
pkt_size = rx_size - 4;
|
||||
|
||||
/* process errors */
|
||||
if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
|
||||
(!(rx_status & RxStatusOK))) {
|
||||
netdrv_rx_err (rx_status, dev, tp, ioaddr);
|
||||
return 1;
|
||||
}
|
||||
|
||||
/************************ note note *********************************/
|
||||
if (--rx_work_limit < 0) { /* we got packets, but no quota */
|
||||
/* store current ring pointer state */
|
||||
tp->cur_rx = cur_rx;
|
||||
|
||||
/* Refill the Rx ring buffers if they are needed */
|
||||
refill_rx_ring(dev);
|
||||
goto not_done;
|
||||
}
|
||||
/********************** end note **********************************/
|
||||
|
||||
/* grab a skb */
|
||||
skb = dev_alloc_skb (pkt_size + 2);
|
||||
if (skb) {
|
||||
.
|
||||
.
|
||||
/************************ note note *********************************/
|
||||
netif_receive_skb (skb);
|
||||
/********************** end note **********************************/
|
||||
.
|
||||
.
|
||||
} else { /* OOM */
|
||||
/*seems very driver specific ... common is just pass
|
||||
whatever is on the ring already. */
|
||||
}
|
||||
|
||||
/* move to the next skb on the ring */
|
||||
entry = (++tp->cur_rx) % RX_RING_SIZE;
|
||||
received++ ;
|
||||
|
||||
}
|
||||
|
||||
/* store current ring pointer state */
|
||||
tp->cur_rx = cur_rx;
|
||||
|
||||
/* Refill the Rx ring buffers if they are needed */
|
||||
refill_rx_ring(dev);
|
||||
|
||||
/* no packets on ring; but new ones can arrive since we last
|
||||
checked */
|
||||
status = read_interrupt_status_reg();
|
||||
if (rx status is not set) {
|
||||
/* If something arrives in this narrow window,
|
||||
an interrupt will be generated */
|
||||
goto done;
|
||||
}
|
||||
/* done! at least that's what it looks like ;->
|
||||
if new packets came in after our last check on status bits
|
||||
they'll be caught by the while check and we go back and clear them
|
||||
since we havent exceeded our quota */
|
||||
} while (rx_status_is_set);
|
||||
|
||||
done:
|
||||
|
||||
/************************ note note *********************************/
|
||||
dev->quota -= received;
|
||||
*budget -= received;
|
||||
|
||||
/* If RX ring is not full we are out of memory. */
|
||||
if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
|
||||
goto oom;
|
||||
|
||||
/* we are happy/done, no more packets on ring; put us back
|
||||
to where we can start processing interrupts again */
|
||||
netif_rx_complete(dev);
|
||||
enable_rx_and_rxnobuf_ints();
|
||||
|
||||
/* The last op happens after poll completion. Which means the following:
|
||||
* 1. it can race with disabling irqs in irq handler (which are done to
|
||||
* schedule polls)
|
||||
* 2. it can race with dis/enabling irqs in other poll threads
|
||||
* 3. if an irq raised after the beginning of the outer beginning
|
||||
* loop (marked in the code above), it will be immediately
|
||||
* triggered here.
|
||||
*
|
||||
* Summarizing: the logic may result in some redundant irqs both
|
||||
* due to races in masking and due to too late acking of already
|
||||
* processed irqs. The good news: no events are ever lost.
|
||||
*/
|
||||
|
||||
return 0; /* done */
|
||||
|
||||
not_done:
|
||||
if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
|
||||
tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
|
||||
refill_rx_ring(dev);
|
||||
|
||||
if (!received) {
|
||||
printk("received==0\n");
|
||||
received = 1;
|
||||
}
|
||||
dev->quota -= received;
|
||||
*budget -= received;
|
||||
return 1; /* not_done */
|
||||
|
||||
oom:
|
||||
/* Start timer, stop polling, but do not enable rx interrupts. */
|
||||
start_poll_timer(dev);
|
||||
return 0; /* we'll take it from here so tell core "done"*/
|
||||
|
||||
/************************ End note note *********************************/
|
||||
}
|
||||
-------------------------------------------------------------------
|
||||
|
||||
From above we note that:
|
||||
0) rx_work_limit = dev->quota
|
||||
1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when
|
||||
it does the work.
|
||||
2) We have a done and not_done state.
|
||||
3) instead of netif_rx() we call netif_receive_skb() to pass the skb.
|
||||
4) we have a new way of handling oom condition
|
||||
5) A new outer for (;;) loop has been added. This serves the purpose of
|
||||
ensuring that if a new packet has come in, after we are all set and done,
|
||||
and we have not exceeded our quota that we continue sending packets up.
|
||||
|
||||
|
||||
-----------------------------------------------------------
|
||||
Poll timer code will need to do the following:
|
||||
|
||||
a)
|
||||
|
||||
if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
|
||||
tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
|
||||
refill_rx_ring(dev);
|
||||
|
||||
/* If RX ring is not full we are still out of memory.
|
||||
Restart the timer again. Else we re-add ourselves
|
||||
to the master poll list.
|
||||
*/
|
||||
|
||||
if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
|
||||
restart_timer();
|
||||
|
||||
else netif_rx_schedule(dev); /* we are back on the poll list */
|
||||
|
||||
5) dev->close() and dev->suspend() issues
|
||||
==========================================
|
||||
The driver writer needn't worry about this; the top net layer takes
|
||||
care of it.
|
||||
|
||||
6) Adding new Stats to /proc
|
||||
=============================
|
||||
In order to debug some of the new features, we introduce new stats
|
||||
that need to be collected.
|
||||
TODO: Fill this later.
|
||||
|
||||
APPENDIX 1: discussion on using ethernet HW FC
|
||||
==============================================
|
||||
Most chips with FC only send a pause packet when they run out of Rx buffers.
|
||||
Since packets are pulled off the DMA ring by a softirq in NAPI,
|
||||
if the system is slow in grabbing them and we have a high input
|
||||
rate (faster than the system's capacity to remove packets), then theoretically
|
||||
there will only be one rx interrupt for all packets during a given packetstorm.
|
||||
Under low load, we might have a single interrupt per packet.
|
||||
FC should be programmed to apply in the case when the system cant pull out
|
||||
packets fast enough i.e send a pause only when you run out of rx buffers.
|
||||
Note FC in itself is a good solution but we have found it to not be
|
||||
much of a commodity feature (both in NICs and switches) and hence falls
|
||||
under the same category as using NIC based mitigation. Also, experiments
|
||||
indicate that it's much harder to resolve the resource allocation
|
||||
issue (aka lazy receiving that NAPI offers) and hence quantify its usefulness
|
||||
proved harder. In any case, FC works even better with NAPI but is not
|
||||
necessary.
|
||||
|
||||
|
||||
APPENDIX 2: the "rotting packet" race-window avoidance scheme
|
||||
=============================================================
|
||||
|
||||
There are two types of associations seen here
|
||||
|
||||
1) status/int which honors level triggered IRQ
|
||||
|
||||
If a status bit for receive or rxnobuff is set and the corresponding
|
||||
interrupt-enable bit is not on, then no interrupts will be generated. However,
|
||||
as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is
|
||||
generated. [assuming the status bit was not turned off].
|
||||
Generally the concept of level triggered IRQs in association with a status and
|
||||
interrupt-enable CSR register set is used to avoid the race.
|
||||
|
||||
If we take the example of the tulip:
|
||||
"pending work" is indicated by the status bit(CSR5 in tulip).
|
||||
the corresponding interrupt bit (CSR7 in tulip) might be turned off (but
|
||||
the CSR5 will continue to be turned on with new packet arrivals even if
|
||||
we clear it the first time)
|
||||
Very important is the fact that if we turn on the interrupt bit on when
|
||||
status is set that an immediate irq is triggered.
|
||||
|
||||
If we cleared the rx ring and proclaimed there was "no more work
|
||||
to be done" and then went on to do a few other things; then when we enable
|
||||
interrupts, there is a possibility that a new packet might sneak in during
|
||||
this phase. It helps to look at the pseudo code for the tulip poll
|
||||
routine:
|
||||
|
||||
--------------------------
|
||||
do {
|
||||
ACK;
|
||||
while (ring_is_not_empty()) {
|
||||
work-work-work
|
||||
if quota is exceeded: exit, no touching irq status/mask
|
||||
}
|
||||
/* No packets, but new can arrive while we are doing this*/
|
||||
CSR5 := read
|
||||
if (CSR5 is not set) {
|
||||
/* If something arrives in this narrow window here,
|
||||
* where the comments are ;-> irq will be generated */
|
||||
unmask irqs;
|
||||
exit poll;
|
||||
}
|
||||
} while (rx_status_is_set);
|
||||
------------------------
|
||||
|
||||
CSR5 bit of interest is only the rx status.
|
||||
If you look at the last if statement:
|
||||
you just finished grabbing all the packets from the rx ring .. you check if
|
||||
status bit says there are more packets just in ... it says none; you then
|
||||
enable rx interrupts again; if a new packet just came in during this check,
|
||||
we are counting that CSR5 will be set in that small window of opportunity
|
||||
and that by re-enabling interrupts, we would actually trigger an interrupt
|
||||
to register the new packet for processing.
|
||||
|
||||
[The above description nay be very verbose, if you have better wording
|
||||
that will make this more understandable, please suggest it.]
|
||||
|
||||
2) non-capable hardware
|
||||
|
||||
These do not generally respect level triggered IRQs. Normally,
|
||||
irqs may be lost while being masked and the only way to leave poll is to do
|
||||
a double check for new input after netif_rx_complete() is invoked
|
||||
and re-enable polling (after seeing this new input).
|
||||
|
||||
Sample code:
|
||||
|
||||
---------
|
||||
.
|
||||
.
|
||||
restart_poll:
|
||||
while (ring_is_not_empty()) {
|
||||
work-work-work
|
||||
if quota is exceeded: exit, not touching irq status/mask
|
||||
}
|
||||
.
|
||||
.
|
||||
.
|
||||
enable_rx_interrupts()
|
||||
netif_rx_complete(dev);
|
||||
if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) {
|
||||
disable_rx_and_rxnobufs()
|
||||
goto restart_poll
|
||||
} while (rx_status_is_set);
|
||||
---------
|
||||
|
||||
Basically netif_rx_complete() removes us from the poll list, but because a
|
||||
new packet which will never be caught due to the possibility of a race
|
||||
might come in, we attempt to re-add ourselves to the poll list.
|
||||
|
||||
|
||||
|
||||
|
||||
APPENDIX 3: Scheduling issues.
|
||||
==============================
|
||||
As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the
|
||||
general solution to schedule softirq's to run before next interrupt and by putting
|
||||
them under scheduler control. Also this prevents consecutive softirq's from
|
||||
monopolize the CPU. This also have the effect that the priority of ksoftirq needs
|
||||
to be considered when running very CPU-intensive applications and networking to
|
||||
get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0
|
||||
(eventually more) is reported cure problems with low network performance at high
|
||||
CPU load.
|
||||
|
||||
Most used processes in a GIGE router:
|
||||
USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND
|
||||
root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0)
|
||||
root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated
|
||||
|
||||
--------------------------------------------------------------------
|
||||
|
||||
relevant sites:
|
||||
==================
|
||||
ftp://robur.slu.se/pub/Linux/net-development/NAPI/
|
||||
|
||||
|
||||
--------------------------------------------------------------------
|
||||
TODO: Write net-skeleton.c driver.
|
||||
-------------------------------------------------------------
|
||||
|
||||
Authors:
|
||||
========
|
||||
Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
|
||||
Jamal Hadi Salim <hadi@cyberus.ca>
|
||||
Robert Olsson <Robert.Olsson@data.slu.se>
|
||||
|
||||
Acknowledgements:
|
||||
================
|
||||
People who made this document better:
|
||||
|
||||
Lennert Buytenhek <buytenh@gnu.org>
|
||||
Andrew Morton <akpm@zip.com.au>
|
||||
Manfred Spraul <manfred@colorfullife.com>
|
||||
Donald Becker <becker@scyld.com>
|
||||
Jeff Garzik <jgarzik@pobox.com>
|
@@ -38,8 +38,13 @@ Socket options
|
||||
DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of
|
||||
service codes (RFC 4340, sec. 8.1.2); if this socket option is not set,
|
||||
the socket will fall back to 0 (which means that no meaningful service code
|
||||
is present). Connecting sockets set at most one service option; for
|
||||
listening sockets, multiple service codes can be specified.
|
||||
is present). On active sockets this is set before connect(); specifying more
|
||||
than one code has no effect (all subsequent service codes are ignored). The
|
||||
case is different for passive sockets, where multiple service codes (up to 32)
|
||||
can be set before calling bind().
|
||||
|
||||
DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet
|
||||
size (application payload size) in bytes, see RFC 4340, section 14.
|
||||
|
||||
DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the
|
||||
partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums
|
||||
@@ -50,12 +55,13 @@ be enabled at the receiver, too with suitable choice of CsCov.
|
||||
DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the
|
||||
range 0..15 are acceptable. The default setting is 0 (full coverage),
|
||||
values between 1..15 indicate partial coverage.
|
||||
DCCP_SOCKOPT_SEND_CSCOV is for the receiver and has a different meaning: it
|
||||
DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it
|
||||
sets a threshold, where again values 0..15 are acceptable. The default
|
||||
of 0 means that all packets with a partial coverage will be discarded.
|
||||
Values in the range 1..15 indicate that packets with minimally such a
|
||||
coverage value are also acceptable. The higher the number, the more
|
||||
restrictive this setting (see [RFC 4340, sec. 9.2.1]).
|
||||
restrictive this setting (see [RFC 4340, sec. 9.2.1]). Partial coverage
|
||||
settings are inherited to the child socket after accept().
|
||||
|
||||
The following two options apply to CCID 3 exclusively and are getsockopt()-only.
|
||||
In either case, a TFRC info struct (defined in <linux/tfrc.h>) is returned.
|
||||
@@ -112,9 +118,14 @@ tx_qlen = 5
|
||||
The size of the transmit buffer in packets. A value of 0 corresponds
|
||||
to an unbounded transmit buffer.
|
||||
|
||||
sync_ratelimit = 125 ms
|
||||
The timeout between subsequent DCCP-Sync packets sent in response to
|
||||
sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit
|
||||
of this parameter is milliseconds; a value of 0 disables rate-limiting.
|
||||
|
||||
Notes
|
||||
=====
|
||||
|
||||
DCCP does not travel through NAT successfully at present on many boxes. This is
|
||||
because the checksum covers the psuedo-header as per TCP and UDP. Linux NAT
|
||||
because the checksum covers the pseudo-header as per TCP and UDP. Linux NAT
|
||||
support for DCCP has been added.
|
||||
|
@@ -1,52 +0,0 @@
|
||||
The Digi International RightSwitch SE-X (dgrs) Device Driver
|
||||
|
||||
This is a Linux driver for the Digi International RightSwitch SE-X
|
||||
EISA and PCI boards. These are 4 (EISA) or 6 (PCI) port Ethernet
|
||||
switches and a NIC combined into a single board. This driver can
|
||||
be compiled into the kernel statically or as a loadable module.
|
||||
|
||||
There is also a companion management tool, called "xrightswitch".
|
||||
The management tool lets you watch the performance graphically,
|
||||
as well as set the SNMP agent IP and IPX addresses, IEEE Spanning
|
||||
Tree, and Aging time. These can also be set from the command line
|
||||
when the driver is loaded. The driver command line options are:
|
||||
|
||||
debug=NNN Debug printing level
|
||||
dma=0/1 Disable/Enable DMA on PCI card
|
||||
spantree=0/1 Disable/Enable IEEE spanning tree
|
||||
hashexpire=NNN Change address aging time (default 300 seconds)
|
||||
ipaddr=A,B,C,D Set SNMP agent IP address i.e. 199,86,8,221
|
||||
iptrap=A,B,C,D Set SNMP agent IP trap address i.e. 199,86,8,221
|
||||
ipxnet=NNN Set SNMP agent IPX network number
|
||||
nicmode=0/1 Disable/Enable multiple NIC mode
|
||||
|
||||
There is also a tool for setting up input and output packet filters
|
||||
on each port, called "dgrsfilt".
|
||||
|
||||
Both the management tool and the filtering tool are available
|
||||
separately from the following FTP site:
|
||||
|
||||
ftp://ftp.dgii.com/drivers/rightswitch/linux/
|
||||
|
||||
When nicmode=1, the board and driver operate as 4 or 6 individual
|
||||
NIC ports (eth0...eth5) instead of as a switch. All switching
|
||||
functions are disabled. In the future, the board firmware may include
|
||||
a routing cache when in this mode.
|
||||
|
||||
Copyright 1995-1996 Digi International Inc.
|
||||
|
||||
This software may be used and distributed according to the terms
|
||||
of the GNU General Public License, incorporated herein by reference.
|
||||
|
||||
For information on purchasing a RightSwitch SE-4 or SE-6
|
||||
board, please contact Digi's sales department at 1-612-912-3444
|
||||
or 1-800-DIGIBRD. Outside the U.S., please check our Web page at:
|
||||
|
||||
http://www.dgii.com
|
||||
|
||||
for sales offices worldwide. Tech support is also available through
|
||||
the channels listed on the Web site, although as long as I am
|
||||
employed on networking products at Digi I will be happy to provide
|
||||
any bug fixes that may be needed.
|
||||
|
||||
-Rick Richardson, rick@dgii.com
|
@@ -180,13 +180,20 @@ tcp_fin_timeout - INTEGER
|
||||
to live longer. Cf. tcp_max_orphans.
|
||||
|
||||
tcp_frto - INTEGER
|
||||
Enables F-RTO, an enhanced recovery algorithm for TCP retransmission
|
||||
Enables Forward RTO-Recovery (F-RTO) defined in RFC4138.
|
||||
F-RTO is an enhanced recovery algorithm for TCP retransmission
|
||||
timeouts. It is particularly beneficial in wireless environments
|
||||
where packet loss is typically due to random radio interference
|
||||
rather than intermediate router congestion. If set to 1, basic
|
||||
version is enabled. 2 enables SACK enhanced F-RTO, which is
|
||||
EXPERIMENTAL. The basic version can be used also when SACK is
|
||||
enabled for a flow through tcp_sack sysctl.
|
||||
rather than intermediate router congestion. FRTO is sender-side
|
||||
only modification. Therefore it does not require any support from
|
||||
the peer, but in a typical case, however, where wireless link is
|
||||
the local access link and most of the data flows downlink, the
|
||||
faraway servers should have FRTO enabled to take advantage of it.
|
||||
If set to 1, basic version is enabled. 2 enables SACK enhanced
|
||||
F-RTO if flow uses SACK. The basic version can be used also when
|
||||
SACK is in use though scenario(s) with it exists where FRTO
|
||||
interacts badly with the packet counting of the SACK enabled TCP
|
||||
flow.
|
||||
|
||||
tcp_frto_response - INTEGER
|
||||
When F-RTO has detected that a TCP retransmission timeout was
|
||||
|
@@ -13,15 +13,35 @@ The radiotap format is discussed in
|
||||
./Documentation/networking/radiotap-headers.txt.
|
||||
|
||||
Despite 13 radiotap argument types are currently defined, most only make sense
|
||||
to appear on received packets. Currently three kinds of argument are used by
|
||||
the injection code, although it knows to skip any other arguments that are
|
||||
present (facilitating replay of captured radiotap headers directly):
|
||||
to appear on received packets. The following information is parsed from the
|
||||
radiotap headers and used to control injection:
|
||||
|
||||
- IEEE80211_RADIOTAP_RATE - u8 arg in 500kbps units (0x02 --> 1Mbps)
|
||||
* IEEE80211_RADIOTAP_RATE
|
||||
|
||||
- IEEE80211_RADIOTAP_ANTENNA - u8 arg, 0x00 = ant1, 0x01 = ant2
|
||||
rate in 500kbps units, automatic if invalid or not present
|
||||
|
||||
- IEEE80211_RADIOTAP_DBM_TX_POWER - u8 arg, dBm
|
||||
|
||||
* IEEE80211_RADIOTAP_ANTENNA
|
||||
|
||||
antenna to use, automatic if not present
|
||||
|
||||
|
||||
* IEEE80211_RADIOTAP_DBM_TX_POWER
|
||||
|
||||
transmit power in dBm, automatic if not present
|
||||
|
||||
|
||||
* IEEE80211_RADIOTAP_FLAGS
|
||||
|
||||
IEEE80211_RADIOTAP_F_FCS: FCS will be removed and recalculated
|
||||
IEEE80211_RADIOTAP_F_WEP: frame will be encrypted if key available
|
||||
IEEE80211_RADIOTAP_F_FRAG: frame will be fragmented if longer than the
|
||||
current fragmentation threshold. Note that
|
||||
this flag is only reliable when software
|
||||
fragmentation is enabled)
|
||||
|
||||
The injection code can also skip all other currently defined radiotap fields
|
||||
facilitating replay of captured radiotap headers directly.
|
||||
|
||||
Here is an example valid radiotap header defining these three parameters
|
||||
|
||||
|
@@ -3,6 +3,10 @@ started by Ingo Molnar <mingo@redhat.com>, 2001.09.17
|
||||
2.6 port and netpoll api by Matt Mackall <mpm@selenic.com>, Sep 9 2003
|
||||
|
||||
Please send bug reports to Matt Mackall <mpm@selenic.com>
|
||||
and Satyam Sharma <satyam.sharma@gmail.com>
|
||||
|
||||
Introduction:
|
||||
=============
|
||||
|
||||
This module logs kernel printk messages over UDP allowing debugging of
|
||||
problem where disk logging fails and serial consoles are impractical.
|
||||
@@ -13,6 +17,9 @@ the specified interface as soon as possible. While this doesn't allow
|
||||
capture of early kernel panics, it does capture most of the boot
|
||||
process.
|
||||
|
||||
Sender and receiver configuration:
|
||||
==================================
|
||||
|
||||
It takes a string configuration parameter "netconsole" in the
|
||||
following format:
|
||||
|
||||
@@ -34,21 +41,113 @@ Examples:
|
||||
|
||||
insmod netconsole netconsole=@/,@10.0.0.2/
|
||||
|
||||
It also supports logging to multiple remote agents by specifying
|
||||
parameters for the multiple agents separated by semicolons and the
|
||||
complete string enclosed in "quotes", thusly:
|
||||
|
||||
modprobe netconsole netconsole="@/,@10.0.0.2/;@/eth1,6892@10.0.0.3/"
|
||||
|
||||
Built-in netconsole starts immediately after the TCP stack is
|
||||
initialized and attempts to bring up the supplied dev at the supplied
|
||||
address.
|
||||
|
||||
The remote host can run either 'netcat -u -l -p <port>' or syslogd.
|
||||
|
||||
Dynamic reconfiguration:
|
||||
========================
|
||||
|
||||
Dynamic reconfigurability is a useful addition to netconsole that enables
|
||||
remote logging targets to be dynamically added, removed, or have their
|
||||
parameters reconfigured at runtime from a configfs-based userspace interface.
|
||||
[ Note that the parameters of netconsole targets that were specified/created
|
||||
from the boot/module option are not exposed via this interface, and hence
|
||||
cannot be modified dynamically. ]
|
||||
|
||||
To include this feature, select CONFIG_NETCONSOLE_DYNAMIC when building the
|
||||
netconsole module (or kernel, if netconsole is built-in).
|
||||
|
||||
Some examples follow (where configfs is mounted at the /sys/kernel/config
|
||||
mountpoint).
|
||||
|
||||
To add a remote logging target (target names can be arbitrary):
|
||||
|
||||
cd /sys/kernel/config/netconsole/
|
||||
mkdir target1
|
||||
|
||||
Note that newly created targets have default parameter values (as mentioned
|
||||
above) and are disabled by default -- they must first be enabled by writing
|
||||
"1" to the "enabled" attribute (usually after setting parameters accordingly)
|
||||
as described below.
|
||||
|
||||
To remove a target:
|
||||
|
||||
rmdir /sys/kernel/config/netconsole/othertarget/
|
||||
|
||||
The interface exposes these parameters of a netconsole target to userspace:
|
||||
|
||||
enabled Is this target currently enabled? (read-write)
|
||||
dev_name Local network interface name (read-write)
|
||||
local_port Source UDP port to use (read-write)
|
||||
remote_port Remote agent's UDP port (read-write)
|
||||
local_ip Source IP address to use (read-write)
|
||||
remote_ip Remote agent's IP address (read-write)
|
||||
local_mac Local interface's MAC address (read-only)
|
||||
remote_mac Remote agent's MAC address (read-write)
|
||||
|
||||
The "enabled" attribute is also used to control whether the parameters of
|
||||
a target can be updated or not -- you can modify the parameters of only
|
||||
disabled targets (i.e. if "enabled" is 0).
|
||||
|
||||
To update a target's parameters:
|
||||
|
||||
cat enabled # check if enabled is 1
|
||||
echo 0 > enabled # disable the target (if required)
|
||||
echo eth2 > dev_name # set local interface
|
||||
echo 10.0.0.4 > remote_ip # update some parameter
|
||||
echo cb:a9:87:65:43:21 > remote_mac # update more parameters
|
||||
echo 1 > enabled # enable target again
|
||||
|
||||
You can also update the local interface dynamically. This is especially
|
||||
useful if you want to use interfaces that have newly come up (and may not
|
||||
have existed when netconsole was loaded / initialized).
|
||||
|
||||
Miscellaneous notes:
|
||||
====================
|
||||
|
||||
WARNING: the default target ethernet setting uses the broadcast
|
||||
ethernet address to send packets, which can cause increased load on
|
||||
other systems on the same ethernet segment.
|
||||
|
||||
TIP: some LAN switches may be configured to suppress ethernet broadcasts
|
||||
so it is advised to explicitly specify the remote agents' MAC addresses
|
||||
from the config parameters passed to netconsole.
|
||||
|
||||
TIP: to find out the MAC address of, say, 10.0.0.2, you may try using:
|
||||
|
||||
ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2
|
||||
|
||||
TIP: in case the remote logging agent is on a separate LAN subnet than
|
||||
the sender, it is suggested to try specifying the MAC address of the
|
||||
default gateway (you may use /sbin/route -n to find it out) as the
|
||||
remote MAC address instead.
|
||||
|
||||
NOTE: the network device (eth1 in the above case) can run any kind
|
||||
of other network traffic, netconsole is not intrusive. Netconsole
|
||||
might cause slight delays in other traffic if the volume of kernel
|
||||
messages is high, but should have no other impact.
|
||||
|
||||
NOTE: if you find that the remote logging agent is not receiving or
|
||||
printing all messages from the sender, it is likely that you have set
|
||||
the "console_loglevel" parameter (on the sender) to only send high
|
||||
priority messages to the console. You can change this at runtime using:
|
||||
|
||||
dmesg -n 8
|
||||
|
||||
or by specifying "debug" on the kernel command line at boot, to send
|
||||
all kernel messages to the console. A specific value for this parameter
|
||||
can also be set using the "loglevel" kernel boot option. See the
|
||||
dmesg(8) man page and Documentation/kernel-parameters.txt for details.
|
||||
|
||||
Netconsole was designed to be as instantaneous as possible, to
|
||||
enable the logging of even the most critical kernel bugs. It works
|
||||
from IRQ contexts as well, and does not enable interrupts while
|
||||
|
@@ -73,7 +73,8 @@ dev->hard_start_xmit:
|
||||
has to lock by itself when needed. It is recommended to use a try lock
|
||||
for this and return NETDEV_TX_LOCKED when the spin lock fails.
|
||||
The locking there should also properly protect against
|
||||
set_multicast_list.
|
||||
set_multicast_list. Note that the use of NETIF_F_LLTX is deprecated.
|
||||
Dont use it for new drivers.
|
||||
|
||||
Context: Process with BHs disabled or BH (timer),
|
||||
will be called with interrupts disabled by netconsole.
|
||||
@@ -95,9 +96,13 @@ dev->set_multicast_list:
|
||||
Synchronization: netif_tx_lock spinlock.
|
||||
Context: BHs disabled
|
||||
|
||||
dev->poll:
|
||||
Synchronization: __LINK_STATE_RX_SCHED bit in dev->state. See
|
||||
dev_close code and comments in net/core/dev.c for more info.
|
||||
struct napi_struct synchronization rules
|
||||
========================================
|
||||
napi->poll:
|
||||
Synchronization: NAPI_STATE_SCHED bit in napi->state. Device
|
||||
driver's dev->close method will invoke napi_disable() on
|
||||
all NAPI instances which will do a sleeping poll on the
|
||||
NAPI_STATE_SCHED napi->state bit, waiting for all pending
|
||||
NAPI activity to cease.
|
||||
Context: softirq
|
||||
will be called with interrupts disabled by netconsole.
|
||||
|
||||
|
@@ -1824,6 +1824,162 @@ platforms are moved over to use the flattened-device-tree model.
|
||||
fsl,has-rstcr;
|
||||
};
|
||||
|
||||
|
||||
h) 4xx/Axon EMAC ethernet nodes
|
||||
|
||||
The EMAC ethernet controller in IBM and AMCC 4xx chips, and also
|
||||
the Axon bridge. To operate this needs to interact with a ths
|
||||
special McMAL DMA controller, and sometimes an RGMII or ZMII
|
||||
interface. In addition to the nodes and properties described
|
||||
below, the node for the OPB bus on which the EMAC sits must have a
|
||||
correct clock-frequency property.
|
||||
|
||||
i) The EMAC node itself
|
||||
|
||||
Required properties:
|
||||
- device_type : "network"
|
||||
|
||||
- compatible : compatible list, contains 2 entries, first is
|
||||
"ibm,emac-CHIP" where CHIP is the host ASIC (440gx,
|
||||
405gp, Axon) and second is either "ibm,emac" or
|
||||
"ibm,emac4". For Axon, thus, we have: "ibm,emac-axon",
|
||||
"ibm,emac4"
|
||||
- interrupts : <interrupt mapping for EMAC IRQ and WOL IRQ>
|
||||
- interrupt-parent : optional, if needed for interrupt mapping
|
||||
- reg : <registers mapping>
|
||||
- local-mac-address : 6 bytes, MAC address
|
||||
- mal-device : phandle of the associated McMAL node
|
||||
- mal-tx-channel : 1 cell, index of the tx channel on McMAL associated
|
||||
with this EMAC
|
||||
- mal-rx-channel : 1 cell, index of the rx channel on McMAL associated
|
||||
with this EMAC
|
||||
- cell-index : 1 cell, hardware index of the EMAC cell on a given
|
||||
ASIC (typically 0x0 and 0x1 for EMAC0 and EMAC1 on
|
||||
each Axon chip)
|
||||
- max-frame-size : 1 cell, maximum frame size supported in bytes
|
||||
- rx-fifo-size : 1 cell, Rx fifo size in bytes for 10 and 100 Mb/sec
|
||||
operations.
|
||||
For Axon, 2048
|
||||
- tx-fifo-size : 1 cell, Tx fifo size in bytes for 10 and 100 Mb/sec
|
||||
operations.
|
||||
For Axon, 2048.
|
||||
- fifo-entry-size : 1 cell, size of a fifo entry (used to calculate
|
||||
thresholds).
|
||||
For Axon, 0x00000010
|
||||
- mal-burst-size : 1 cell, MAL burst size (used to calculate thresholds)
|
||||
in bytes.
|
||||
For Axon, 0x00000100 (I think ...)
|
||||
- phy-mode : string, mode of operations of the PHY interface.
|
||||
Supported values are: "mii", "rmii", "smii", "rgmii",
|
||||
"tbi", "gmii", rtbi", "sgmii".
|
||||
For Axon on CAB, it is "rgmii"
|
||||
- mdio-device : 1 cell, required iff using shared MDIO registers
|
||||
(440EP). phandle of the EMAC to use to drive the
|
||||
MDIO lines for the PHY used by this EMAC.
|
||||
- zmii-device : 1 cell, required iff connected to a ZMII. phandle of
|
||||
the ZMII device node
|
||||
- zmii-channel : 1 cell, required iff connected to a ZMII. Which ZMII
|
||||
channel or 0xffffffff if ZMII is only used for MDIO.
|
||||
- rgmii-device : 1 cell, required iff connected to an RGMII. phandle
|
||||
of the RGMII device node.
|
||||
For Axon: phandle of plb5/plb4/opb/rgmii
|
||||
- rgmii-channel : 1 cell, required iff connected to an RGMII. Which
|
||||
RGMII channel is used by this EMAC.
|
||||
Fox Axon: present, whatever value is appropriate for each
|
||||
EMAC, that is the content of the current (bogus) "phy-port"
|
||||
property.
|
||||
|
||||
Recommended properties:
|
||||
- linux,network-index : This is the intended "index" of this
|
||||
network device. This is used by the bootwrapper to interpret
|
||||
MAC addresses passed by the firmware when no information other
|
||||
than indices is available to associate an address with a device.
|
||||
|
||||
Optional properties:
|
||||
- phy-address : 1 cell, optional, MDIO address of the PHY. If absent,
|
||||
a search is performed.
|
||||
- phy-map : 1 cell, optional, bitmap of addresses to probe the PHY
|
||||
for, used if phy-address is absent. bit 0x00000001 is
|
||||
MDIO address 0.
|
||||
For Axon it can be absent, thouugh my current driver
|
||||
doesn't handle phy-address yet so for now, keep
|
||||
0x00ffffff in it.
|
||||
- rx-fifo-size-gige : 1 cell, Rx fifo size in bytes for 1000 Mb/sec
|
||||
operations (if absent the value is the same as
|
||||
rx-fifo-size). For Axon, either absent or 2048.
|
||||
- tx-fifo-size-gige : 1 cell, Tx fifo size in bytes for 1000 Mb/sec
|
||||
operations (if absent the value is the same as
|
||||
tx-fifo-size). For Axon, either absent or 2048.
|
||||
- tah-device : 1 cell, optional. If connected to a TAH engine for
|
||||
offload, phandle of the TAH device node.
|
||||
- tah-channel : 1 cell, optional. If appropriate, channel used on the
|
||||
TAH engine.
|
||||
|
||||
Example:
|
||||
|
||||
EMAC0: ethernet@40000800 {
|
||||
linux,network-index = <0>;
|
||||
device_type = "network";
|
||||
compatible = "ibm,emac-440gp", "ibm,emac";
|
||||
interrupt-parent = <&UIC1>;
|
||||
interrupts = <1c 4 1d 4>;
|
||||
reg = <40000800 70>;
|
||||
local-mac-address = [00 04 AC E3 1B 1E];
|
||||
mal-device = <&MAL0>;
|
||||
mal-tx-channel = <0 1>;
|
||||
mal-rx-channel = <0>;
|
||||
cell-index = <0>;
|
||||
max-frame-size = <5dc>;
|
||||
rx-fifo-size = <1000>;
|
||||
tx-fifo-size = <800>;
|
||||
phy-mode = "rmii";
|
||||
phy-map = <00000001>;
|
||||
zmii-device = <&ZMII0>;
|
||||
zmii-channel = <0>;
|
||||
};
|
||||
|
||||
ii) McMAL node
|
||||
|
||||
Required properties:
|
||||
- device_type : "dma-controller"
|
||||
- compatible : compatible list, containing 2 entries, first is
|
||||
"ibm,mcmal-CHIP" where CHIP is the host ASIC (like
|
||||
emac) and the second is either "ibm,mcmal" or
|
||||
"ibm,mcmal2".
|
||||
For Axon, "ibm,mcmal-axon","ibm,mcmal2"
|
||||
- interrupts : <interrupt mapping for the MAL interrupts sources:
|
||||
5 sources: tx_eob, rx_eob, serr, txde, rxde>.
|
||||
For Axon: This is _different_ from the current
|
||||
firmware. We use the "delayed" interrupts for txeob
|
||||
and rxeob. Thus we end up with mapping those 5 MPIC
|
||||
interrupts, all level positive sensitive: 10, 11, 32,
|
||||
33, 34 (in decimal)
|
||||
- dcr-reg : < DCR registers range >
|
||||
- dcr-parent : if needed for dcr-reg
|
||||
- num-tx-chans : 1 cell, number of Tx channels
|
||||
- num-rx-chans : 1 cell, number of Rx channels
|
||||
|
||||
iii) ZMII node
|
||||
|
||||
Required properties:
|
||||
- compatible : compatible list, containing 2 entries, first is
|
||||
"ibm,zmii-CHIP" where CHIP is the host ASIC (like
|
||||
EMAC) and the second is "ibm,zmii".
|
||||
For Axon, there is no ZMII node.
|
||||
- reg : <registers mapping>
|
||||
|
||||
iv) RGMII node
|
||||
|
||||
Required properties:
|
||||
- compatible : compatible list, containing 2 entries, first is
|
||||
"ibm,rgmii-CHIP" where CHIP is the host ASIC (like
|
||||
EMAC) and the second is "ibm,rgmii".
|
||||
For Axon, "ibm,rgmii-axon","ibm,rgmii"
|
||||
- reg : <registers mapping>
|
||||
- revision : as provided by the RGMII new version register if
|
||||
available.
|
||||
For Axon: 0x0000012a
|
||||
|
||||
More devices will be defined as this spec matures.
|
||||
|
||||
VII - Specifying interrupt information for devices
|
||||
|
89
Documentation/rfkill.txt
普通文件
89
Documentation/rfkill.txt
普通文件
@@ -0,0 +1,89 @@
|
||||
rfkill - RF switch subsystem support
|
||||
====================================
|
||||
|
||||
1 Implementation details
|
||||
2 Driver support
|
||||
3 Userspace support
|
||||
|
||||
===============================================================================
|
||||
1: Implementation details
|
||||
|
||||
The rfkill switch subsystem offers support for keys often found on laptops
|
||||
to enable wireless devices like WiFi and Bluetooth.
|
||||
|
||||
This is done by providing the user 3 possibilities:
|
||||
1 - The rfkill system handles all events; userspace is not aware of events.
|
||||
2 - The rfkill system handles all events; userspace is informed about the events.
|
||||
3 - The rfkill system does not handle events; userspace handles all events.
|
||||
|
||||
The buttons to enable and disable the wireless radios are important in
|
||||
situations where the user is for example using his laptop on a location where
|
||||
wireless radios _must_ be disabled (e.g. airplanes).
|
||||
Because of this requirement, userspace support for the keys should not be
|
||||
made mandatory. Because userspace might want to perform some additional smarter
|
||||
tasks when the key is pressed, rfkill still provides userspace the possibility
|
||||
to take over the task to handle the key events.
|
||||
|
||||
The system inside the kernel has been split into 2 separate sections:
|
||||
1 - RFKILL
|
||||
2 - RFKILL_INPUT
|
||||
|
||||
The first option enables rfkill support and will make sure userspace will
|
||||
be notified of any events through the input device. It also creates several
|
||||
sysfs entries which can be used by userspace. See section "Userspace support".
|
||||
|
||||
The second option provides an rfkill input handler. This handler will
|
||||
listen to all rfkill key events and will toggle the radio accordingly.
|
||||
With this option enabled userspace could either do nothing or simply
|
||||
perform monitoring tasks.
|
||||
|
||||
====================================
|
||||
2: Driver support
|
||||
|
||||
To build a driver with rfkill subsystem support, the driver should
|
||||
depend on the Kconfig symbol RFKILL; it should _not_ depend on
|
||||
RKFILL_INPUT.
|
||||
|
||||
Unless key events trigger an interrupt to which the driver listens, polling
|
||||
will be required to determine the key state changes. For this the input
|
||||
layer providers the input-polldev handler.
|
||||
|
||||
A driver should implement a few steps to correctly make use of the
|
||||
rfkill subsystem. First for non-polling drivers:
|
||||
|
||||
- rfkill_allocate()
|
||||
- input_allocate_device()
|
||||
- rfkill_register()
|
||||
- input_register_device()
|
||||
|
||||
For polling drivers:
|
||||
|
||||
- rfkill_allocate()
|
||||
- input_allocate_polled_device()
|
||||
- rfkill_register()
|
||||
- input_register_polled_device()
|
||||
|
||||
When a key event has been detected, the correct event should be
|
||||
sent over the input device which has been registered by the driver.
|
||||
|
||||
====================================
|
||||
3: Userspace support
|
||||
|
||||
For each key an input device will be created which will send out the correct
|
||||
key event when the rfkill key has been pressed.
|
||||
|
||||
The following sysfs entries will be created:
|
||||
|
||||
name: Name assigned by driver to this key (interface or driver name).
|
||||
type: Name of the key type ("wlan", "bluetooth", etc).
|
||||
state: Current state of the key. 1: On, 0: Off.
|
||||
claim: 1: Userspace handles events, 0: Kernel handles events
|
||||
|
||||
Both the "state" and "claim" entries are also writable. For the "state" entry
|
||||
this means that when 1 or 0 is written all radios, not yet in the requested
|
||||
state, will be will be toggled accordingly.
|
||||
For the "claim" entry writing 1 to it means that the kernel no longer handles
|
||||
key events even though RFKILL_INPUT input was enabled. When "claim" has been
|
||||
set to 0, userspace should make sure that it listens for the input events or
|
||||
check the sysfs "state" entry regularly to correctly perform the required
|
||||
tasks when the rkfill key is pressed.
|
在新工单中引用
屏蔽一个用户