This is used by TCP listener core, and listener socket shall
not be modified by inet_csk_route_req().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Very soon, TCP stack might call inet_csk_route_req(), which
calls inet_csk_route_req() with an unlocked listener socket,
so we need to make sure ip_route_output_flow() is not trying to
change any field from its socket argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Soon, listener socket wont be locked when tcp_openreq_init_rwin()
is called. We need to read socket fields once, as their value
could change under us.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Soon, listener socket spinlock will no longer be held,
add const arguments to tcp_v[46]_init_req() to make clear these
functions can not mess socket fields.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The __netdev_find_adj() helper does not use its first argument, only the
device to find and list to walk through.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creating a IP encapsulated tunnel the necessary l2tp module
should be loaded. It already works for UDP encapsulation, it just
doesn't work for direct IP encap.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It should not be necessary to do explicit module loading when
configuring L2TP. Modules should be loaded as needed instead
(as is done already with netlink and other tunnel types).
This patch adds a new module alias type and code to load
the sub module on demand.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide a device_type information for slave network devices created by
DSA, this is useful for user-space application to easily locate/search
for devices of a specific kind.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable patches:
- fix v4.2 SEEK on files over 2 gigs
- Fix a layout segment reference leak when pNFS I/O falls back to inband I/O.
- Fix recovery of recalled read delegations
Bugfixes:
- Fix a case where NFSv4 fails to send CLOSE after a server reboot
- Fix sunrpc to wait for connections to complete before retrying
- Fix sunrpc races between transport connect/disconnect and shutdown
- Fix an infinite loop when layoutget fail with BAD_STATEID
- nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
- Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set
- Fix layoutreturn/close ordering issues"
* tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS41: make close wait for layoutreturn
NFS: Skip checking ds_cinfo.buckets when lseg's commit_through_mds is set
NFSv4.x/pnfs: Don't try to recover stateids twice in layoutget
NFSv4: Recovery of recalled read delegations is broken
NFS: Fix an infinite loop when layoutget fail with BAD_STATEID
NFS: Do cleanup before resetting pageio read/write to mds
SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose
SUNRPC: Lock the transport layer on shutdown
nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
SUNRPC: Ensure that we wait for connections to complete before retrying
SUNRPC: drop null test before destroy functions
nfs: fix v4.2 SEEK on files over 2 gigs
SUNRPC: Fix races between socket connection and destroy code
nfs: fix pg_test page count calculation
Failing to send a CLOSE if file is opened WRONLY and server reboots on a 4.x mount
The core API has changed so that devices that do not have a global
DMA lkey automatically create an mr, per-PD, and make that lkey
available. The global DMA lkey interface is going away in favor of
the per-PD DMA lkey.
The per-PD DMA lkey is always available. Convert xprtrdma to use the
device's per-PD DMA lkey for regbufs, no matter which memory
registration scheme is in use.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: linux-nfs <linux-nfs@vger.kernel.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
of_find_net_device_by_node() uses class_find_device() internally to
lookup the corresponding network device. class_find_device() returns
a reference to the embedded struct device, with its refcount
incremented.
Add a comment to the definition in net/core/net-sysfs.c indicating the
need to drop this refcount, and fix the DSA code to drop this refcount
when the OF-generated platform data is cleaned up and freed. Also
arrange for the ref to be dropped when handling errors.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current users of of_mdio_find_bus() leak a struct device refcount, as
they fail to clean up the reference obtained inside class_find_device().
Fix the DSA code to properly refcount the returned MDIO bus by:
1. taking a reference on the struct device whenever we assign it to
pd->chip[x].host_dev.
2. dropping the reference when we overwrite the existing reference.
3. dropping the reference when we free the data structure.
4. dropping the initial reference we obtained after setting up the
platform data structure, or on failure.
In step 2 above, where we obtain a new MDIO bus, there is no need to
take a reference on it as we would only have to drop it immediately
after assignment again, iow:
put_device(cd->host_dev); /* drop original assignment ref */
cd->host_dev = get_device(&mdio_bus_switch->dev); /* get our ref */
put_device(&mdio_bus_switch->dev); /* drop of_mdio_find_bus ref */
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now, since we have only 2 values for transaction phase, just use bool.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The enum is going to disappear, use the helpers instead.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer used by drivers, as transaction queue with item destructors
takes care of abort phase internally in switchdev code. So kill it.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before it disappears completely, move transaction phase enum under
transaction structure and make attr/obj structures a bit cleaner.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now, the memory allocation in prepare/commit state is done separatelly
in each driver (rocker). Introduce the similar mechanism in generic
switchdev code, in form of queue. That can be used not only for memory
allocations, but also for different items. Abort item destruction
is handled as well.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is temporary, name "trans" will be used for something else and
"trans_ph" will eventually disappear.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman says:
====================
Second Round of IPVS Updates for v4.4
please consider these bug fixes and extensive clean-ups of IPVS
from Eric Biederman for v4.4.
His excellent description of the changes, which is part of an even larger
set of clean-up work, is as follows:
I am gradually working my way through the netfilter stack passing struct
down into the netfilter hooks and from the netfilter hooks and from there
down into the functions that actually care. This removes the need for
netfilter functions to guess how to figure out how to compute which
network namespace they are in and instead provides a simple and reliable
method to do so.
The cleanups stand on their own but this is part of a larger effort to
have routes with an output device that is not in the current network
namespace.
The IPVS code has been a bit more of a challenge than most. Just passing
struct net through to where it is needed did not feel clean to me. The
practical issue is that the ipvs code in most places actually wants
struct netns_ipvs and not struct net.
So as part of this process I have turned the relationship between struct
net and the structs netns_ipvs, ip_vs_conn_param, ip_vs_conn, and
ip_vs_service inside out. I have modified the ipvs functions to take a
struct netns_ipvs not a struct net. The net is code with fewer
conversions from one type of structure to another. I did wind up adding
a struct netns_ipvs parameter to quite a few functions that did not have
it before so I could pass the structure down from the netfilter hooks to
where it is actually needed to avoid guessing.
I have broken up the work in a bunch of small patches so there is at
least a chance and reviewing that each step I took is correct. The
series compiles at each step so bisecting it should not be a problem if
something weird comes up.
The first two changes in this series are actually bug fixes. The first
is a compile fix for a bug in sctp that came in, in the last round of
ipvs changes merged into nf-next. The second fixes an older bug where in
pathological circumstances the wrong network namespace could be used when
a proc file is written to.
The rest of the patchset is a bunch of boring changes getting pushing
struct netns_ipvs (and by extension ipvs->net) where it needs to be.
Either by replacing struct net pointers or adding new struct netns_ipvs
pointers. With a handful of other minor cleanups (like removing
skb_net).
I have decided include the bug fixes in this pull request. Patch one
relates to a bug that was added to nf-next recently and is thus not
applicable to nf . Patch two could arguably be promoted to a fix for v4.3
and stable though it does not appear to be severe enough to warrant that
course of action; let me know if you would like me to reconsider.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently error log messages in ip6_tnl_err are printed at 'warn'
level. This is different to other tunnel types which don't print
any messages. These log messages don't provide any information that
couldn't be deduced with networking tools. Also it can be annoying
to have one end of the tunnel go down and have the logs fill with
pointless messages such as "Path to destination invalid or inactive!".
This patch reduces the log level of these messages to 'dbg' level to
bring the visible behaviour into line with other tunnel types.
Signed-off-by: Matt Bennett <matt.bennett@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg says:
====================
Just two small fixes:
* VHT MCS mask array overrun, reported by Dan Carpenter
* reset CQM history to always get a notification, from Sara Sharon
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently error log messages in ip6gre_err are printed at 'warn'
level. This is different to most other tunnel types which don't
print any messages. These log messages don't provide any information
that couldn't be deduced with networking tools. Also it can be annoying
to have one end of the tunnel go down and have the logs fill with
pointless messages such as "Path to destination invalid or inactive!".
This patch reduces the log level of these messages to 'dbg' level to
bring the visible behaviour into line with other tunnel types.
Signed-off-by: Matt Bennett <matt.bennett@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
dump_rules returns skb length and not error.
But when family == AF_UNSPEC, the caller of dump_rules
assumes that it returns an error. Hence, when family == AF_UNSPEC,
we continue trying to dump on -EMSGSIZE errors resulting in
incorrect dump idx carried between skbs belonging to the same dump.
This results in fib rule dump always only dumping rules that fit
into the first skb.
This patch fixes dump_rules to return error so that we exit correctly
and idx is correctly maintained between skbs that are part of the
same dump.
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neal suggested to move sk_txhash init into tcp_create_openreq_child(),
called both from IPv4 and IPv6.
This opportunity was missed in commit 58d607d3e5 ("tcp: provide
skb->hash to synack packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fw filter uses tp->root==NULL to check if it is the old method,
so it doesn't need allocation at all in this case. This patch
reverts the offending commit and adds some comments for old
method to make it obvious.
Fixes: 33f8b9ecdb ("net_sched: move tp->root allocation into fw_init()")
Reported-by: Akshat Kakkar <akshat.1984@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP tunnel config is asymmetric wrt. to the ports used. The source and
destination ports from one direction of the tunnel are not related to the
ports of the other direction. We need to be able to respond to ARP requests
using the correct ports without involving routing.
As the consequence, UDP ports need to be fixed property of the tunnel
interface and cannot be set per route. Remove the ability to set ports per
route. This is still okay to do, as no kernel has been released with these
attributes yet.
Note that the ability to specify source and destination ports is preserved
for other users of the lwtunnel API which don't use routes for tunnel key
specification (like openvswitch).
If in the future we rework ARP handling to allow port specification, the
attributes can be added back.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using ip lwtunnels, the additional data for xmit (basically, the actual
tunnel to use) are carried in ip_tunnel_info either in dst->lwtstate or in
metadata dst. When replying to ARP requests, we need to send the reply to
the same tunnel the request came from. This means we need to construct
proper metadata dst for ARP replies.
We could perform another route lookup to get a dst entry with the correct
lwtstate. However, this won't always ensure that the outgoing tunnel is the
same as the incoming one, and it won't work anyway for IPv4 duplicate
address detection.
The only thing to do is to "reverse" the ip_tunnel_info.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 12fd84f438 ("ipv6: Remove unused neigh argument for
icmp6_dst_alloc() and its callers."), the neigh parameter of ndisc_send_na
and ndisc_send_ns is unused.
CC: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The genl_notify function has too many arguments for no real reason - all
callers use genl_info to get them anyway. Just pass the genl_info down to
genl_notify.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way. You have to pair store_release
> and load_acquire. Besides, it isn't a particularly good idea to
OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.
> depend on memory barriers embedded in other data structures like the
> above. Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.
But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.
> There's no reason to be overly smart here. This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.
It's not about being overly smart. It's about actually understanding
what's going on with the code. I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.
> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> > }
> > }
> >
> > - if (!nlk->portid) {
> > + if (!nlk->bound) {
>
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable. That doesn't change anything. Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.
The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.
However, there is a real bug here that none of these acquire/release
helpers discovered. The two bound tests here used to be a single
one. Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket. So we need to
repeat the portid check in order to maintain consistency.
> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> > !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> > return -EPERM;
> >
> > - if (!nlk->portid)
> > + if (!nlk->bound)
>
> Don't we need load_acquire here too? Is this path holding a lock
> which makes that unnecessary?
Ditto.
---8<---
The commit 1f770c0a09 ("netlink:
Fix autobind race condition that leads to zero port ID") created
some new races that can occur due to inconcsistencies between the
two port IDs.
Tejun is right that a barrier is unavoidable. Therefore I am
reverting to the original patch that used a boolean to indicate
that a user netlink socket has been bound.
Barriers have been added where necessary to ensure that a valid
portid and the hashed socket is visible.
I have also changed netlink_insert to only return EBUSY if the
socket is bound to a portid different to the requested one. This
combined with only reading nlk->bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.
Fixes: 1f770c0a09 ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Nacked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The callback "create" and "parse" from header_ops are called from
netdev core upper-layer functionality, like af_packet. These callbacks
assumes big endian for addresses and we should not introduce a special
byteordering handling for ieee802154 over af_packet in userspace.
We have an identical issue with setting the mac address which also
assumes big endian byteordering.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Fixes dropped packets in the tx path in case a non-PS station triggers
the tx filter.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This will be needed later when the network namespace guessing is
removed from ip_defrag.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
This function adds no real value and it obscures what the code is doing.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
The argument is unnecessary and in practice confusing,
and has caused the callers to do all manner of silly things.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
This removes the need to use the hack skb_net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
This removes the need to compute ipvs with the hack "net_ipvs(skb_net(skb))"
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
With ipvs passed into ip_vs_in_icmp and ip_vs_in_icmp_v6
they no longer need to call the hack that is skb_net.
Additionally ipvs_in_icmp no longer needs to call dev_net(skb->dev)
and can use the ipvs->net instead.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Derive ipvs from state->net in the callers of ip_vs_in and pass it
into ip_vs_out. Removing the need to use the hack skb_net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Derive ipvs from state->net in the callers of ip_vs_out and pass it
into ip_vs_out. Removing the need to use the hack skb_net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Stop using the hack skb_net(skb) to compute the network namespace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
With sysctl_cache_bypass now a compile time constant the compiler can
figue out that it can elimiate all of the code that depends on
sysctl_cache_bypass being true.
Also remove the duplicate computation of net previously necessitated
by #ifdef CONFIG_SYSCTL
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
This allows two different ways for computing/guessing net to be
removed from ensure_mtu_is_adequate.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Don't use "net_ipvs(skb_net(skb))" as skb_net is a bad hack. Instead
use cp->ipvs and ipvs->net for the net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>