d_add() with inode->i_lock already held; common to d_add() and
d_splice_alias(). All ->lookup() instances that end up hashing
the dentry they are given will hash it here.
This almost completes the preparations to parallel lookups
proper - the only remaining bit is taking security_d_instantiate()
past d_rehash() and doing rehashing without dropping ->d_lock.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
First of all, don't bother calling it if inode is NULL -
that makes inode argument unused. Moreover, do it *before*
dropping ->d_lock, not right after that (and don't bother
grabbing ->d_lock in it, of course).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new primitive: d_exact_alias(dentry, inode). If there is an unhashed
dentry with the same name/parent and given inode, rehash, grab and
return it. Otherwise, return NULL. The only caller of d_add_unique()
switched to d_exact_alias() + d_splice_alias().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and use d_add(dn, NULL) in case we need to hash a negative
unhashed rather than using d_rehash() directly.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and make mountpoint_last() use it. That makes all
candidates for lookup with parent locked shared go
through lookup_slow().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
No need to lock parent just because of ->d_revalidate() on child;
contrary to the stale comment, lookup_dcache() *can* be used without
locking the parent. Result can be moved as soon as we return, of
course, but the same is true for lookup_one_len_unlocked() itself.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Have lookup_fast() return 1 on success and 0 on "need to fall back";
lookup_slow() and follow_managed() return positive (1) on success.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Alexander Duyck says:
====================
Fix differences between IPv4 and IPv6 TCP/UDP checksum calculation
This patch series is meant to address the differences that exist between
IPv4 and IPv6 in terms of checksum calculation. Specifically the IPv6
function csum_ipv6_magic treated length as a value that could be greater
than 64K, while csum_tcpudp_magic was truncating the length at 16 bits.
After looking over the code and giving it some thought I decided it would
be best to update the IPv4 function so that it worked the same way the IPv6
one did. This allows us to get the same results given the same inputs for
both functions. As a result we can use the same processes to reverse the
calculation in the event we need to do something like remove the length of
the pseudo-header checksum.
I also took the opportunity to standardize things so that the parameters
for these functions all use the correct types. IPv4 addresses are __be32,
length should always be __u32, and protocol is a __u8.
With this change in place it corrects an issue with UDP tunnels in which we
were getting a checksum that was off by 1 when performing fragmentation on
inner UDP packets.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
It is possible for tunnels to end up generating IP or IPv6 datagrams that
are larger than 64K and expecting to be segmented. As such we need to deal
with length values greater than 64K. In order to accommodate this we need
to update the code to work with a 32b length value instead of a 16b one.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates csum_ipv6_magic so that it correctly recognizes that
protocol is a unsigned 8 bit value.
This will allow us to better understand what limitations may or may not be
present in how we handle the data. For example there are a number of
places that call htonl on the protocol value. This is likely not necessary
and can be replaced with a multiplication by ntohl(1) which will be
converted to a shift by the compiler.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates all instances of csum_tcpudp_magic and
csum_tcpudp_nofold to reflect the types that are usually used as the source
inputs. For example the protocol field is populated based on nexthdr which
is actually an unsigned 8 bit value. The length is usually populated based
on skb->len which is an unsigned integer.
This addresses an issue in which the IPv6 function csum_ipv6_magic was
generating a checksum using the full 32b of skb->len while
csum_tcpudp_magic was only using the lower 16 bits. As a result we could
run into issues when attempting to adjust the checksum as there was no
protocol agnostic way to update it.
With this change the value is still truncated as many architectures use
"(len + proto) << 8", however this truncation only occurs for values
greater than 16776960 in length and as such is unlikely to occur as we stop
the inner headers at ~64K in size.
I did have to make a few minor changes in the arm, mn10300, nios2, and
score versions of the function in order to support these changes as they
were either using things such as an OR to combine the protocol and length,
or were using ntohs to convert the length which would have truncated the
value.
I also updated a few spots in terms of whitespace and type differences for
the addresses. Most of this was just to make sure all of the definitions
were in sync going forward.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an inetdev is destroyed, every address assigned to the interface
is removed. And in this scenerio we do two pointless things which can
be very expensive if the number of assigned interfaces is large:
1) Address promotion. We are deleting all addresses, so there is no
point in doing this.
2) A full nf conntrack table purge for every address. We only need to
do this once, as is already caught by the existing
masq_dev_notifier so masq_inet_event() can skip this.
Reported-by: Solar Designer <solar@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Samuel Ortiz says:
====================
NFC 4.6 pull request
This is a very small one this time, with only 5 patches.
There are a couple of big items that could not be merged/finished
on time.
We have:
- 2 LLCP fixes for a race and a potential OOM.
- 2 cleanups for the pn544 and microread drivers.
- 1 Maintainer addition for the s3fwrn5 driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sabrina Dubroca says:
====================
MACsec IEEE 802.1AE implementation
MACsec (IEEE 802.1AE [0]) is a protocol that provides security for
wired ethernet LANs. MACsec offers two protection modes:
authentication only, or authenticated encryption.
MACsec defines "secure channels" that allow transmission from one node
to one or more others. Communication on a channel is done over a
succession of "secure associations", that each use a specific key.
Secure associations are identified by their "association number" in
the range 0..3. A secure association is retired when its 32-bit
packet number would wrap, and the same association number can later be
reused with a new key and packet number.
The standard mode of encryption is GCM AES with 128 bits keys,
although an extension allows 256 bits keys [1] (not implemented in
this submission).
When using MACsec, an extra header, called "SecTAG", is added between
the ethernet header and the original payload:
+---------------------------------+----------------+----------------+
| (MACsec ethertype) | TCI_AN | SL |
+---------------------------------+----------------+----------------+
| Packet Number |
+-------------------------------------------------------------------+
| Secure Channel Identifier |
| (optional) |
+-------------------------------------------------------------------+
TCI_AN:
version
end_station
sci_present
scb
encrypted
changed_text
association_number (2 bits)
SL:
short_length (6 bits)
unused (2 bits)
The ethertype for the packet is set to 0x88E5, and the original
ethertype becomes part of the secure payload, which may be encrypted.
The ethernet header and the SecTAG are always transmitted in the
clear, but are integrity-protected.
MACsec supports optional replay protection with a configurable replay
window.
MACsec is designed to be used with the MKA extension to 802.1X (MACsec
Key Agreement protocol) [2], which provides channel attribution and
key distribution to the nodes, but can also be used with static keys
getting fed manually by an administrator.
Optional (not supported yet) features:
- confidentiality offset: in encryption mode, part of the payload may
be left unencrypted.
- choice of cipher suite: GCM AES with 256 bits has been standardised
[1].
Implementation
A netdevice is created on top of a real device for each TX secure
channel, like we do for VLANs. Multiple TX channels can be created on
top of the same underlying device.
Several other approaches were considered for the RX path:
- dev_add_pack: doesn't work, because we want to filter out
unprotected packets
- transparent mode: MACsec would be enabled directly on the real
netdevice. For this, we cannot use a rx_handler directly because
MACsec must be available for underlying devices enslaved in a
bridge or in a bond, so we need a hook directly in
__netif_receive_skb_core. This approach makes it harder to filter
non-encrypted packets on RX without forcing the user to setup some
rules, so the "transparent" mode is not so transparent after all.
It also makes TX more complex than with a dedicated netdevice.
One issue with the proposed implementation is that the qdisc layer for
the real device operates on already encrypted packets.
Netlink API
This is currently a mix of rtnetlink (to create the device and set up
the TX channel) and genl (for RX channels, secure associations and
their keys). genl provides clean demultiplexing of the {TX,RX}{SC,SA}
commands.
Use cases
The normal use case is wired LANs, including veth and slave devices
for bonding/teaming or bridges.
MACsec can also be used on any device that makes a full ethernet
header visible, for example VXLAN.
The VXLAN+MACsec setup would be:
hypervisor | virtual machine
<real_dev>---<VXLAN>---|---<dev>---<macsec_dev>
And the packets would look like this:
| eth | IP | UDP | VXLAN | eth | MACsec | IP | ... | MACsec ICV |
One benefit on this approach to encryption in the cloud is that the
payload is encrypted by the tenant, not by the tunnel provider, thus
the tenant has full control over the keys.
Changes from v1:
- rework netlink API after discussion with Johannes Berg
- nest attributes, rename
- export stats as separate attributes
- add some comments
- misc small fixes (rcu, constants, struct organization)
Changes from RFCv2:
- fix ENCODING_SA param validation
- add parent link to netlink ifdumps
Changes from RFCv1:
- addressed comments from Florian and Paolo + kbuild robot
- also perform post-decrypt handling after crypto callback
- fixed ->dellink behavior
Future plans:
- offload to hardware, on nics that support it
- implement optional features
[0] http://standards.ieee.org/getieee802/download/802.1AE-2006.pdf
[1] http://standards.ieee.org/getieee802/download/802.1AEbn-2011.pdf
[2] http://standards.ieee.org/getieee802/download/802.1X-2010.pdf
[3] RFCv1: http://www.spinics.net/lists/netdev/msg358151.html
[4] RFCv2: http://www.spinics.net/lists/netdev/msg362389.html
[5] v1: http://www.spinics.net/lists/netdev/msg367959.html
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to use the static variable here, pr_info_once is more
concise.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When operating the at803x in SGMII mode, resuming the chip
from power down brings up the copper-side link but leaves
the SGMII link in unconnected state (tested with at8031
attached to gianfar). In effect, this caused a permanent
link loss once the related interface was put down.
This patch ensures that power down handling in supspend()
and resume() is also applied to the SGMII link.
Signed-off-by: Zefir Kurtisi <zefir.kurtisi@neratec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer says:
====================
net: bulk free adjustment and two driver use-cases
I've split out the bulk free adjustments, from the bulk alloc patches,
as I want the adjustment to napi_consume_skb be in same kernel cycle
the API was introduced.
Adjustments based on discussion:
Subj: "mlx4: use napi_consume_skb API to get bulk free operations"
http://thread.gmane.org/gmane.linux.network/402503/focus=403386
Patchset based on net-next at commit 3ebeac1d02
V4: more nitpicks from Sergei
V3: spelling fixes from Sergei
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Bulk free of SKBs happen transparently by the API call napi_consume_skb().
The napi budget parameter is needed by napi_consume_skb() to detect
if called from netpoll.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bulk free of SKBs happen transparently by the API call napi_consume_skb().
The napi budget parameter is usually needed by napi_consume_skb()
to detect if called from netpoll. In this patch it has an extra meaning.
For mlx4 driver, the mlx4_en_stop_port() call is done outside
NAPI/softirq context, and cleanup the entire TX ring via
mlx4_en_free_tx_buf(). The code mlx4_en_free_tx_desc() for
freeing SKBs are shared with NAPI calls.
To handle this shared use the zero budget indication is reused,
and handled appropriately in napi_consume_skb(). To reflect this,
variable is called napi_mode for the function call that needed
this distinction.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers reuse/share code paths that free SKBs between NAPI
and non-NAPI calls. Adjust napi_consume_skb to handle this
use-case.
Before, calls from netpoll (w/ IRQs disabled) was handled and
indicated with a budget zero indication. Use the same zero
indication to handle calls not originating from NAPI/softirq.
Simply handled by using dev_consume_skb_any().
This adds an extra branch+call for the netpoll case (checking
in_irq() + irqs_disabled()), but that is okay as this is a slowpath.
Suggested-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For pcie nic, after setting link speed and there is no link driver does not need
to do phy reset until link up.
For some pcie nics, to do this will also reset phy speed down counter and prevent
phy from auto speed down.
This patch fix the issue reported in following link.
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1547151
Signed-off-by: Chunhao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Firmware now tells us that the reset is done by passing a magic value
via register. Use it to shorten the wait in case this is supported.
With old firmware, we still wait until the timeout is reached.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently sctp_sendmsg() triggers some calls that will allocate memory
with GFP_ATOMIC even when not necessary. In the case of
sctp_packet_transmit it will allocate a linear skb that will be used to
construct the packet and this may cause sends to fail due to ENOMEM more
often than anticipated specially with big MTUs.
This patch thus allows it to inherit gfp flags from upper calls so that
it can use GFP_KERNEL if it was triggered by a sctp_sendmsg call or
similar. All others, like retransmits or flushes started from BH, are
still allocated using GFP_ATOMIC.
In netperf tests this didn't result in any performance drawbacks when
memory is not too fragmented and made it trigger ENOMEM way less often.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we want to change a flow using netlink, we have to identify it to
be able to perform a lookup. Both the flow key and unique flow ID
(ufid) are valid identifiers, but we always have to specify the flow
key in the netlink message. When both attributes are there, the ufid
is used. The flow key is used to validate the actions provided by
the userland.
This commit allows to use the ufid without having to provide the flow
key, as it is already done in the netlink 'flow get' and 'flow del'
path. The flow key remains mandatory when an action is provided.
Signed-off-by: Samuel Gauthier <samuel.gauthier@6wind.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
On AT91 SoCs, the User Register (USRIO) exposes a switch to configure the
"Reduced" or "Traditional" version of the Media Independent Interface
(RMII vs. MII or RGMII vs. GMII).
As on the older EMAC version, on GMAC, this switch is set by default to the
non-reduced type of interface, so use the existing capability and extend it to
GMII as well. We then keep the current logic in the macb_init() function.
The capabilities of sama5d2, sama5d4 and sama5d3 GEM interface are updated in
the macb_config structure to be able to properly enable them with a traditional
interface (GMII or MII).
Reported-by: Romain HENRIET <romain.henriet@l-acoustics.com>
Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e5a03bfd87 ("phy: Add an mdio_device structure") removed addr,
bus and dev member of the phy_device structure.
This patch remove the documentation about those members.
Signed-off-by: LABBE Corentin <clabbe.montjoie@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Durrant says:
====================
xen-netback: fix multiple extra info handling
If a frontend passes multiple extra info fragments to netback on the guest
transmit side, because xen-netback does not account for this properly, only
a single ack response will be sent. This will eventually cause processing
of the shared ring to wedge.
This series re-imports the canonical netif.h from Xen, where the ring
protocol documentation has been updated, fixes this issue in xen-netback
and also adds a patch to reduce log spam.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the "prepare for reconnect" pr_info in xenbus.c. It's largely
uninteresting and the states of the frontend and backend can easily be
observed by watching the (o)xenstored log.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code does not currently support a frontend passing multiple extra info
fragments to the backend in a tx request. The xenvif_get_extras() function
handles multiple extra_info fragments but make_tx_response() assumes there
is only ever a single extra info fragment.
This patch modifies xenvif_get_extras() to pass back a count of extra
info fragments, which is then passed to make_tx_response() (after
possibly being stashed in pending_tx_info for deferred responses).
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The canonical netif header (in the Xen source repo) and the Linux variant
have diverged significantly. Recently much documentation has been added to
the canonical header which is highly useful for developers making
modifications to either xen-netfront or xen-netback. This patch therefore
re-imports the canonical header in its entirity.
To maintain compatibility and some style consistency with the old Linux
variant, the header was stripped of its emacs boilerplate, and
post-processed and copied into place with the following commands:
ed -s netif.h << EOF
H
,s/NETTXF_/XEN_NETTXF_/g
,s/NETRXF_/XEN_NETRXF_/g
,s/NETIF_/XEN_NETIF_/g
,s/XEN_XEN_/XEN_/g
,s/netif/xen_netif/g
,s/xen_xen_/xen_/g
,s/^typedef.*$//g
,s/^ /${TAB}/g
w
$
w
EOF
indent --line-length 80 --linux-style netif.h \
-o include/xen/interface/io/netif.h
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds macro NETCONFA_ALL to represent all type of netconf
attributes for IPv4 and IPv6.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>