Use the address of ipvs not the address of net when computing the
hash value. This removes an unncessary dependency on struct net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data. So store a pointer to
struct netns_ipvs.
Update the accesses of param->net to access param->ipvs->net instead.
In functions where we are searching for an svc and filtering by net
filter by ipvs instead.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
ipvs is what is actually desired so change the parameter and the modify
the callers to pass struct netns_ipvs.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data. So store a pointer to
struct netns_ipvs.
Update the accesses of param->net to access param->ipvs->net instead.
When lookup up struct ip_vs_conn in a hash table replace comparisons
of cp->net with comparisons of cp->ipvs which is possible
now that ipvs is present in ip_vs_conn_param.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data. So store a pointer to
struct netns_ipvs.
Update the accesses of conn->net to access conn->ipvs->net instead.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Instead store ipvs in extra2 so that proc_do_defense_mode can easily
find the ipvs that it's value is associated with.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
The addition of sysctl_sloppy_sctp in sctp_conn_schedule resulted
in a use of ipvs before it was computed. Hoist the computation of
ipvs earlier to avoid this problem.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Commit 7d82410950 ("virtio: add explicit big-endian support to memory
accessors") accidentally changed the virtio_net header used by
AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.
Since virtio_legacy_is_little_endian() is a very long identifier,
define a vio_le macro and use that throughout the code instead of the
hard-coded 'false' for little-endian.
This restores the ABI to match 4.1 and earlier kernels, and makes my
test program work again.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers might call napi_disable while not holding the napi instance poll_lock.
In those instances, its possible for a race condition to exist between
poll_one_napi and napi_disable. That is to say, poll_one_napi only tests the
NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
the following may happen:
CPU0 CPU1
ndo_tx_timeout napi_poll_dev
napi_disable poll_one_napi
test_and_set_bit (ret 0)
test_bit (ret 1)
reset adapter napi_poll_routine
If the adapter gets a tx timeout without a napi instance scheduled, its possible
for the adapter to think it has exclusive access to the hardware (as the napi
instance is now scheduled via the napi_disable call), while the netpoll code
thinks there is simply work to do. The result is parallel hardware access
leading to corrupt data structures in the driver, and a crash.
Additionaly, there is another, more critical race between netpoll and
napi_disable. The disabled napi state is actually identical to the scheduled
state for a given napi instance. The implication being that, if a napi instance
is disabled, a netconsole instance would see the napi state of the device as
having been scheduled, and poll it, likely while the driver was dong something
requiring exclusive access. In the case above, its fairly clear that not having
the rings in a state ready to be polled will cause any number of crashes.
The fix should be pretty easy. netpoll uses its own bit to indicate that that
the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
We can just gate disabling on that bit as well as the sched bit. That should
prevent netpoll from conducting a napi poll if we convert its set bit to a
test_and_set_bit operation to provide mutual exclusion
Change notes:
V2)
Remove a trailing whtiespace
Resubmit with proper subject prefix
V3)
Clean up spacing nits
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: jmaxwell@redhat.com
Tested-by: jmaxwell@redhat.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Jamal suggested to further limit the currently allowed subset of opcodes
that may be used by a direct action return code as the intention is not
to replace the full action engine, but rather to have a minimal set that
can be used in the fast-path on things like ingress for some features
that cls_bpf supports.
Classifiers can, of course, still be chained together that have direct
action mode with those that have a full exec pass. For more complex
scenarios that go beyond this minimal set here, the full tcf_exts_exec()
path must be used.
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The binding to a particular classid was so far always mandatory for
cls_bpf, but it doesn't need to be. Therefore, lift this restriction
as similarly done in other classifiers.
Only a couple of qdiscs make use of class from the tcf_result, others
don't strictly care, so let the user choose his needs (those that read
out class can handle situations where it could be NULL).
An explicit check for tcf_unbind_filter() is also not needed here, as
the previous r->class was 0, so the xchg() will return that and
therefore a callback to the qdisc's unbind_tcf() is skipped.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 43388da42a49 ("cls_bpf: introduce integrated actions") we
have added TCA_BPF_FLAGS. We can also retrieve this information from
the prog, dump it back to user space as well. It's useful in tc when
displaying/dumping filter info.
Also, remove tp from cls_bpf_prog_from_efd(), came in as a conflict
from a rebase and it's unused here (later work may add it along with
a real user).
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similarly as already the case in bpf_redirect()/skb_do_redirect()
pair, let the stack deal with devs that are !IFF_UP.
dev_forward_skb() as well as dev_queue_xmit() will free the skb
and increment drop counter internally in such cases, so we can
spare the condition in bpf_clone_redirect().
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RST packets sent on behalf of TCP connections with TS option (RFC 7323
TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.
A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
ecr 0], length 0
B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
1460,nop,nop,TS val 7264344 ecr 100], length 0
A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
7264344], length 0
B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
ecr 110], length 0
We need to call skb_mstamp_get() to get proper TS val,
derived from skb->skb_mstamp
Note that RFC 1323 was advocating to not send TS option in RST segment,
but RFC 7323 recommends the opposite :
Once TSopt has been successfully negotiated, that is both <SYN> and
<SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
segment for the duration of the connection, and SHOULD be sent in an
<RST> segment (see Section 5.2 for details)
Note this RFC recommends to send TS val = 0, but we believe it is
premature : We do not know if all TCP stacks are properly
handling the receive side :
When an <RST> segment is
received, it MUST NOT be subjected to the PAWS check by verifying an
acceptable value in SEG.TSval, and information from the Timestamps
option MUST NOT be used to update connection state information.
SEG.TSecr MAY be used to provide stricter <RST> acceptance checks.
In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
to decide to send TS val = 0, if it buys something.
Fixes: 7faee5c0d5 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In rt6_info_hash_nhsfn replace the custom hashing over flowi6 that is
using xor with a call to common function get_hash_from_flowi6.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Marvell Egress rx trailer check must be fixed to
correctly detect bad bits in the third byte of the
Eggress trailer as described in the Table 28 of the
88E6060 datasheet.
The current code incorrectly omits to check the third
byte and checks the fourth byte twice.
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When support for megaflows was introduced, OVS needed to start
installing flows with a mask applied to them. Since masking is an
expensive operation, OVS also had an optimization that would only
take the parts of the flow keys that were covered by a non-zero
mask. The values stored in the remaining pieces should not matter
because they are masked out.
While this works fine for the purposes of matching (which must always
look at the mask), serialization to netlink can be problematic. Since
the flow and the mask are serialized separately, the uninitialized
portions of the flow can be encoded with whatever values happen to be
present.
In terms of functionality, this has little effect since these fields
will be masked out by definition. However, it leaks kernel memory to
userspace, which is a potential security vulnerability. It is also
possible that other code paths could look at the masked key and get
uninitialized data, although this does not currently appear to be an
issue in practice.
This removes the mask optimization for flows that are being installed.
This was always intended to be the case as the mask optimizations were
really targetting per-packet flow operations.
Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 51360155ec and adapts
fs/userfaultfd.c to use the old version of that function.
It didn't look robust to call __wake_up_common with "nr == 1" when we
absolutely require wakeall semantics, but we've full control of what we
insert in the two waitqueue heads of the blocked userfaults. No
exclusive waitqueue risks to be inserted into those two waitqueue heads
so we can as well stick to "nr == 1" of the old code and we can rely
purely on the fact no waitqueue inserted in one of the two waitqueue
heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next tree
in this 4.4 development cycle, they are:
1) Schedule ICMP traffic to IPVS instances, this introduces a new schedule_icmp
proc knob to enable/disable it. By default is off to retain the old
behaviour. Patchset from Alex Gartrell.
I'm also including what Alex originally said for the record:
"The configuration of ipvs at Facebook is relatively straightforward. All
ipvs instances bgp advertise a set of VIPs and the network prefers the
nearest one or uses ECMP in the event of a tie. For the uninitiated, ECMP
deterministically and statelessly load balances by hashing the packet
(usually a 5-tuple of protocol, saddr, daddr, sport, and dport) and using
that number as an index (basic hash table type logic).
The problem is that ICMP packets (which contain really important
information like whether or not an MTU has been exceeded) will get a
different hash value and may end up at a different ipvs instance. With no
information about where to route these packets, they are dropped, creating
ICMP black holes and breaking Path MTU discovery. Suddenly, my mom's
pictures can't load and I'm fielding midday calls that I want nothing to do
with.
To address this, this patch set introduces the ability to schedule icmp
packets which is gated by a sysctl net.ipv4.vs.schedule_icmp. If set to 0,
the old behavior is maintained -- otherwise ICMP packets are scheduled."
2) Add another proc entry to ignore tunneled packets to avoid routing loops
from IPVS, also from Alex.
3) Fifteen patches from Eric Biederman to:
* Stop passing nf_hook_ops as parameter to the hook and use the state hook
object instead all around the netfilter code, so only the private data
pointer is passed to the registered hook function.
* Now that we've got state->net, propagate the netns pointer to netfilter hook
clients to avoid its computation over and over again. A good example of how
this has been simplified is the former TEE target (now nf_dup infrastructure)
since it has killed the ugly pick_net() function.
There's another round of netns updates from Eric Biederman making the line. To
avoid the patchbomb again to almost all the networking mailing list (that is 84
patches) I'd suggest we send you a pull request with no patches or let me know
if you prefer a better way.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The current behavior of notifying CQM events is inconsistent:
Upon first configuration there is a cqm event with the current
status according to threshold configured, regardless of signal
stability.
When there is reconfiguration no event is sent unless there is
a significant change to the signal level according to the new
configuration.
Since the current reconfiguration behavior might cause missing
CQM events in case the current signal did not change but is on
the other side of the new threshold, fix that by resetting the
stored signal level upon reconfiguration.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This allows ieee80211_tx_monitor to be used directly for sending 802.11 frames
to all monitor interfaces.
Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Many drivers implement reading current TX power (using either cfg80211
or ieee80211 op) but userspace can't get it using nl80211. Right now the
only way to access it is to call some wext ioctl.
Let's put TX power in interface info reply (callback is wdev specific)
just like we do with current channel.
To be consistent (e.g. NL80211_CMD_SET_WIPHY) let's use mBm as na unit.
Signed-off-by: Rafał Miłecki <zajec5@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
It doesn't seem problematic to change the weight for the average
beacon signal from 3 to 4, so use DECLARE_EWMA. This also makes
the code easier to maintain since bugs like the one fixed in the
previous patch can't happen as easily.
With a fix from Avraham Stern to invert the sign since EMWA uses
unsigned values only.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The ifmgd->ave_beacon_signal value cannot be taken as is for
comparisons, it must be divided by since it's represented
like that for better accuracy of the EWMA calculations. This
would lead to invalid driver RSSI events. Fix the used value.
Fixes: 615f7b9bb1 ("mac80211: add driver RSSI threshold events")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
These file aren't really useful:
- if per beacon data is required then you need to use
radiotap or similar anyway, debugfs won't help much
- average beacon signal is reported in station info in
nl80211 and can be looked up with iw
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Implement the basics required for supporting very high throughput
with mesh: include VHT information elements in beacons, probe
responses, and peering action frames, and check for compatible VHT
configurations when peering.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Clear the Channel Center Frequency Segment 2 in VHT operation
IEs to avoid sending non-zero values if the SKB wasn't zeroed
before adding the VHT operation IE.
Signed-off-by: Chun-Yeow Yeoh <yeohchunyeow@gmail.com>
[change commit message a bit - not necessarily just mesh related]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Drivers may be interested in receiving A-MSDU within A-MDPU.
Not all the devices may be able to do so, make it configurable.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The direct probe step before authentication was done mostly for
two reasons:
1) the BSS data could be stale
2) the beacon might not have included all IEs
The concern (1) doesn't really seem to be relevant any more as
we time out BSS information after about 30 seconds, and in fact
the original patch only did the direct probe if the data was
older than the BSS timeout to begin with. This condition got
(likely inadvertedly) removed later though.
Analysing this in more detail shows that since we mostly use
data from the association response, the only real reason for
needing the probe response was that the code validates the WMM
parameters, and those are optional in beacons. As the previous
patches removed that behaviour, we can now remove the direct
probe step entirely.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Advertise the capability to send A-MSDU within A-MPDU
in the AddBA request sent by mac80211. Let the driver
know about the peer's capabilities.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently the cfg80211's frame registration api receives wdev, however
mac80211 assumes per device filter configuration and ignores wdev.
Per device filtering is too wasteful, especially for multi-channel
devices.
Introduce new per vif frame registration API and use it for probe
request registrations in ieee80211_mgmt_frame_register()
Also call directly to ieee80211_configure_filter instead of using a work
since it is now allowed to sleep in ieee80211_mgmt_frame_register.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Sometimes we are interested in testing TDLS performance in a specific
width setting. Add the ability to disable the wider-band feature, thereby
allowing the TDLS channel width to be controlled by the BSS width.
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Queued frames aren't processed during scan, which results in an inability
to complete the BA session establishment until the scan ends. Since we
can't tx frames until the BA agreement setup is complete, it might result
in a very large latency during scan.
Fix this by allowing to process queued skbs while scanning in HW. This
should be ok since the devices which support hw scan should be able
to handle tx/rx while scanning.
During SW scan, mac80211 drops any txed frames besides probes and NDPs,
so it is still needed to delay processing of the queued frames till the
SW scan is done.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The HT MCS mask has 9 bytes, the VHT one only has 8 streams.
Split the loops to handle this correctly.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Sending over AF_PACKET RAW sockets we can sending frames which exceeds
MTU size. To handling it correct we need to change things in AF_PACKET
which knows on RAW sockets an additional FCS is set by hardware or
mac802154 transmit functionality.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch cleanups needed_headroom, needed_tailroom and hard_header_len
fields for wpan and lowpan interfaces.
For wpan interfaces the worst case mac header len should be part of
needed_headroom, currently this is set as hard_header_len, but
hard_header_len should be set to the minimum header length which xmit
call assumes and this is the minimum frame length of 802.15.4.
The hard_header_len value will check inside send callbacl of AF_PACKET
raw sockets.
For lowpan interfaces, if fragmentation isn't needed the skb will
call dev_hard_header for 802154 layer and queue it afterwards. This
happens without new skb allocation, so we need the same headroom and
tailroom lengths like 802154 inside 802154 6lowpan layer. At least we
assume as minimum header length an ipv6 header size.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>