Now when using 'ss' in iproute, kernel would try to load all _diag
modules, which also causes corresponding family and proto modules
to be loaded as well due to module dependencies.
Like after running 'ss', sctp, dccp, af_packet (if it works as a module)
would be loaded.
For example:
$ lsmod|grep sctp
$ ss
$ lsmod|grep sctp
sctp_diag 16384 0
sctp 323584 5 sctp_diag
inet_diag 24576 4 raw_diag,tcp_diag,sctp_diag,udp_diag
libcrc32c 16384 3 nf_conntrack,nf_nat,sctp
As these family and proto modules are loaded unintentionally, it
could cause some problems, like:
- Some debug tools use 'ss' to collect the socket info, which loads all
those diag and family and protocol modules. It's noisy for identifying
issues.
- Users usually expect to drop sctp init packet silently when they
have no sense of sctp protocol instead of sending abort back.
- It wastes resources (especially with multiple netns), and SCTP module
can't be unloaded once it's loaded.
...
In short, it's really inappropriate to have these family and proto
modules loaded unexpectedly when just doing debugging with inet_diag.
This patch is to introduce sock_load_diag_module() where it loads
the _diag module only when it's corresponding family or proto has
been already registered.
Note that we can't just load _diag module without the family or
proto loaded, as some symbols used in _diag module are from the
family or proto module.
v1->v2:
- move inet proto check to inet_diag to avoid a compiling err.
v2->v3:
- define sock_load_diag_module in sock.c and export one symbol
only.
- improve the changelog.
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Phil Sutter <phil@nwl.cc>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 664fcf123a (net: phy: Threaded interrupts allow some simplification)
the phy_interrupt system was changed to use a traditional threaded
interrupt scheme instead of a workqueue approach.
With this change, the phy status check moved into phy_change, which
did not report back to the caller whether or not the interrupt was
handled. This means that, in the case of a shared phy interrupt,
only the first phydev's interrupt registers are checked (since
phy_interrupt() would always return IRQ_HANDLED). This leads to
interrupt storms when it is a secondary device that's actually the
interrupt source.
Signed-off-by: Brad Mouring <brad.mouring@ni.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull "This is the pxa changes for v4.17 cycle" from Robert Jarzmik:
- minor changes for property API
- clock API fix for ULPI driver warning
It exceptionally contains a merge from the mtd tree from Boris
to prevent any merge conflicts in the PXA tree.
* tag 'pxa-for-4.17' of https://github.com/rjarzmik/linux:
ARM: pxa/raumfeld: use PROPERTY_ENTRY_U32() directly
ARM: pxa: ulpi: fix ulpi timeout and slowpath warn
ARM: pxa: cm-x300: remove inline directive
ARM: pxa: fix static checker warning in pxa3xx-ulpi
MAINTAINERS: remove entry for deleted pxa3xx_nand driver
arm: dts: pxa: use reworked NAND controller driver
dt-bindings: mtd: remove pxa3xx NAND controller documentation
mtd: nand: remove useless fields from pxa3xx NAND platform data
mtd: nand: remove deprecated pxa3xx_nand driver
mtd: nand: use Marvell reworked NAND controller driver with all platforms
This adds two new disk triggers for triggering on reads
and writes respectively, named "disk-read" and "disk-write".
The use case comes from working on the D-Link DNS-313 NAS
box. This features an RGB LED for disk activity. with
these two triggers I can couple the green LED to read
activity and the red LED to write activity, which gives
the appropriate user feedback about what is happening
on the disk. When tested it gave exactly the feedback
desired.
The in-kernel interface is simply changed to pass a bool
indicating if the activity is write activity and update
each trigger (and the composite "disk-activity" trigger)
depending on what is passed in.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Pull RCU updates from Paul E. McKenney:
- Miscellaneous fixes, perhaps most notably removing obsolete
code whose only purpose in life was to gather information for
the now-removed RCU debugfs facility. Other notable changes
include removing NO_HZ_FULL_ALL in favor of the nohz_full kernel
boot parameter, minor optimizations for expedited grace periods,
some added tracing, creating an RCU-specific workqueue using Tejun's
new WQ_MEM_RECLAIM flag, and several cleanups to code and comments.
- SRCU cleanups and optimizations.
- Torture-test updates, perhaps most notably the adding of ARMv8
support, but also including numerous cleanups and usability fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The ring-buffer code has recusion protection in case tracing ends up tracing
itself, the ring-buffer will detect that it was called at the same context
(normal, softirq, interrupt or NMI), and not continue to record the event.
With the histogram synthetic events, they are called while tracing another
event at the same context. The recusion protection triggers because it
detects tracing at the same context and stops it.
Add ring_buffer_nest_start() and ring_buffer_nest_end() that will notify the
ring buffer that a trace is about to happen within another trace and that it
is intended, and not to trigger the recursion blocking.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
RINGBUF_TYPE_TIME_STAMP is defined but not used, and from what I can
gather was reserved for something like an absolute timestamp feature
for the ring buffer, if not a complete replacement of the current
time_delta scheme.
This code redefines RINGBUF_TYPE_TIME_STAMP to implement absolute time
stamps. Another way to look at it is that it essentially forces
extended time_deltas for all events.
The motivation for doing this is to enable time_deltas that aren't
dependent on previous events in the ring buffer, making it feasible to
use the ring_buffer_event timetamps in a more random-access way, for
purposes other than serial event printing.
To set/reset this mode, use tracing_set_timestamp_abs() from the
previous interface patch.
Link: http://lkml.kernel.org/r/477b362dba1ce7fab9889a1a8e885a62c472f041.1516069914.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When the length of the NTP tick changes significantly, e.g. when an
NTP/PTP application is correcting the initial offset of the clock, a
large value may accumulate in the NTP error before the multiplier
converges to the correct value. It may then take a very long time (hours
or even days) before the error is corrected. This causes the clock to
have an unstable frequency offset, which has a negative impact on the
stability of synchronization with precise time sources (e.g. NTP/PTP
using hardware timestamping or the PTP KVM clock).
Use division to determine the correct multiplier directly from the NTP
tick length and replace the iterative approach. This removes the last
major source of the NTP error. The only remaining source is now limited
resolution of the multiplier, which is corrected by adding 1 to the
multiplier when the system clock is behind the NTP time.
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1520620971-9567-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull PCI fixes from Bjorn Helgaas:
- fix sparc build issue when OF_IRQ not enabled (Guenter Roeck)
- fix enumeration of devices below switches on DesignWare-based
controllers (Koen Vandeputte)
* tag 'pci-v4.16-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: dwc: Fix enumeration end when reaching root subordinate
PCI: Move of_irq_parse_and_map_pci() declaration under OF_IRQ
Some network devices - notably ipvlan slave - are not compatible with
any kind of rx_handler. Currently the hook can be installed but any
configuration (bridge, bond, macsec, ...) is nonfunctional.
This change allocates a priv_flag bit to mark such devices and explicitly
forbid installing a rx_handler if such bit is set. The new bit is used
by ipvlan slave device.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the new PHY wrapper in place we can now handle multiple PHYs.
Remove the code which handles only one generic PHY as this is now
covered (with support for multiple PHYs as well as suspend/resume
support) by the new PHY wrapper.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Tested-by: Neil Armstrong <narmstrong@baylibre.con>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This integrates the PHY wrapper into the core hcd infrastructure.
Multiple PHYs which are part of the HCD's device tree node are now
managed (= powered on/off when needed), by the new usb_phy_roothub code.
Suspend and resume is also supported, however not for
runtime/auto-suspend (which is triggered for example when no devices are
connected to the USB bus). This is needed on some SoCs (for example
Amlogic Meson GXL) because if the PHYs are disabled during auto-suspend
then devices which are plugged in afterwards are not seen by the host.
One example where this is required is the Amlogic GXL and GXM SoCs:
They are using a dwc3 USB controller with up to three ports enabled on
the internal roothub. Each port has it's own PHY which must be enabled
(if one of the PHYs is left disabled then none of the USB ports works at
all).
The new logic works on the Amlogic GXL and GXM SoCs because the dwc3
driver internally creates a xhci-hcd which then registers a HCD which
then triggers our new PHY wrapper.
USB controller drivers can opt out of this by setting
"skip_phy_initialization" in struct usb_hcd to true. This is identical
to how it works for a single USB PHY, so the "multiple PHY" handling is
disabled for drivers that opted out of the management logic of a single
PHY.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Chunfeng Yun <chunfeng.yun@mediatek.com>
Tested-by: Yixun Lan <yixun.lan@amlogic.com>
Tested-by: Neil Armstrong <narmstrong@baylibre.con>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The USB HCD core driver parses the device-tree node for "phys" and
"usb-phys" properties. It also manages the power state of these PHYs
automatically.
However, drivers may opt-out of this behavior by setting "phy" or
"usb_phy" in struct usb_hcd to a non-null value. An example where this
is required is the "Qualcomm USB2 controller", implemented by the
chipidea driver. The hardware requires that the PHY is only powered on
after the "reset completed" event from the controller is received.
A follow-up patch will allow the USB HCD core driver to manage more than
one PHY. Add a new "skip_phy_initialization" bitflag to struct usb_hcd
so drivers can opt-out of any PHY management provided by the USB HCD
core driver.
This also updates the existing drivers so they use the new flag if they
want to opt out of the PHY management provided by the USB HCD core
driver. This means that for these drivers the new "multiple PHY"
handling (which will be added in a follow-up patch) will be disabled as
well.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Acked-by: Peter Chen <peter.chen@nxp.com>
Tested-by: Neil Armstrong <narmstrong@baylibre.con>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After commit fc72d1d54d ("tuntap: XDP transmission"), we can
actually queueing XDP pointers in the pointer ring, so we should
examine the pointer type before freeing the pointer.
Fixes: fc72d1d54d ("tuntap: XDP transmission")
Reported-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0,
ip6tnl0, ip6gre0) are automatically created when the corresponding
module is loaded.
These tunnels are also automatically created when a new network
namespace is created, at a great cost.
In many cases, netns are used for isolation purposes, and these
extra network devices are a waste of resources. We are using
thousands of netns per host, and hit the netns creation/delete
bottleneck a lot. (Many thanks to Kirill for recent work on this)
Add a new sysctl so that we can opt-out from this automatic creation.
Note that these tunnels are still created for the initial namespace,
to be the least intrusive for typical setups.
Tested:
lpk43:~# cat add_del_unshare.sh
for i in `seq 1 40`
do
(for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
done
wait
lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net
lpk43:~# time ./add_del_unshare.sh
real 0m37.521s
user 0m0.886s
sys 7m7.084s
lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net
lpk43:~# time ./add_del_unshare.sh
real 0m4.761s
user 0m0.851s
sys 1m8.343s
lpk43:~#
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new security level works so that it creates one PCIe tunnel to the
connected Thunderbolt dock, removing PCIe links downstream of the dock.
This leaves only the internal USB controller visible.
Display Port tunnels are created normally.
While there make sure security sysfs attribute returns "unknown" for any
future security level.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Preboot ACL is a mechanism that allows connecting Thunderbolt devices
boot time in more secure way than the legacy Thunderbolt boot support.
As with the legacy boot option, this also needs to be enabled from the
BIOS before booting is allowed. Difference to the legacy mode is that
the userspace software explicitly adds device UUIDs by sending a special
message to the ICM firmware. Only the devices listed in the boot ACL are
connected automatically during the boot. This works in both "user" and
"secure" security levels.
We implement this in Linux by exposing a new sysfs attribute (boot_acl)
below each Thunderbolt domain. The userspace software can then update
the full list as needed.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
This is needed by the new ICM interface to find xdomains by route string
instead of link and depth.
While there update existing tb_xdomain_find_* functions to use
tb_xdomain_get() instead of open-coding the same.
Signed-off-by: Radion Mirchevsky <radion.mirchevsky@intel.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
The primary observation is that nohz enter/exit is always from the
current CPU, therefore NOHZ_TICK_STOPPED does not in fact need to be
an atomic.
Secondary is that we appear to have 2 nearly identical hooks in the
nohz enter code, set_cpu_sd_state_idle() and
nohz_balance_enter_idle(). Fold the whole set_cpu_sd_state thing into
nohz_balance_{enter,exit}_idle.
Removes an atomic op from both enter and exit paths.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We use a two-step process to configure a filter with RSS spreading. First,
the RSS context is allocated and configured using ETHTOOL_SRSSH; this
returns an identifier (rss_context) which can then be passed to subsequent
invocations of ETHTOOL_SRXCLSRLINS to specify that the offset from the RSS
indirection table lookup should be added to the queue number (ring_cookie)
when delivering the packet. Drivers for devices which can only use the
indirection table entry directly (not add it to a base queue number)
should reject rule insertions combining RSS with a nonzero ring_cookie.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce functions that modify the queue flags and that protect
these modifications with the request queue lock. Except for moving
one wake_up_all() call from inside to outside a critical section,
this patch does not change any functionality.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the definition of queue_flag_clear_unlocked() up and move the
definition of queue_in_flight() down such that all queue flag
manipulation function definitions become contiguous.
This patch does not change any functionality.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Saeed Mahameed says:
====================
mlx5-updates-2018-02-28-2 (IPSec-2)
This series follows our previous one to lay out the foundations for IPSec
in user-space and extend current kernel netdev IPSec support. As noted in
our previous pull request cover letter "mlx5-updates-2018-02-28-1 (IPSec-1)",
the IPSec mechanism will be supported through our flow steering mechanism.
Therefore, we need to change the initialization order. Furthermore, IPsec
is also supported in both egress and ingress. Since our current flow
steering is egress only, we add an empty (only implemented through FPGA
steering ops) egress namespace to handle that case. We also implement
the required flow steering callbacks and logic in our FPGA driver.
We extend the FPGA support for ESN and modifying a xfrm too. Therefore, we
add support for some new FPGA command interface that supports them. The
other required bits are added too. The new features and requirements are
advertised via cap bits.
Last but not least, we revise our driver's accel_esp API. This API will be
shared between our netdev and IB driver, so we need to have all the required
functionality from both worlds.
Regards,
Aviad and Matan
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tegra210 has a hw bug which can cause IP blocks to lock up when ungating a
domain. The reason is that the logic responsible for resetting the memory
built-in self test mode can come up in an undefined state because its
clock is gated by a second level clock gate (SLCG). Work around this by
making sure the logic will get some clock edges by ensuring the relevant
clock is enabled and temporarily override the relevant SLCGs.
Unfortunately for some IP blocks, the control bits for overriding the
SLCGs are not in CAR, but in the IP block itself. This means we need to
map a few extra register banks in the clock code.
Signed-off-by: Peter De Schrijver <pdeschrijver@nvidia.com>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Hector Martin <marcan@marcan.st>
Tested-by: Andre Heider <a.heider@gmail.com>
Tested-by: Mikko Perttunen <mperttunen@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
fixup mbist
Pull MIPS fixes from James Hogan:
"A miscellaneous pile of MIPS fixes for 4.16:
- move put_compat_sigset() to evade hardened usercopy warnings (4.16)
- select ARCH_HAVE_PC_{SERIO,PARPORT} for Loongson64 platforms (4.16)
- fix kzalloc() failure handling in ath25 (3.19) and Octeon (4.0)
- fix disabling of IPIs during BMIPS suspend (3.19)"
* tag 'mips_fixes_4.16_4' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips:
MIPS: BMIPS: Do not mask IPIs during suspend
MIPS: Loongson64: Select ARCH_MIGHT_HAVE_PC_SERIO
MIPS: Loongson64: Select ARCH_MIGHT_HAVE_PC_PARPORT
signals: Move put_compat_sigset to compat.h to silence hardened usercopy
MIPS: OCTEON: irq: Check for null return on kzalloc allocation
MIPS: ath25: Check for kzalloc allocation failure
We currently have to rely on the GCC large code model for KASLR for
two distinct but related reasons:
- if we enable full randomization, modules will be loaded very far away
from the core kernel, where they are out of range for ADRP instructions,
- even without full randomization, the fact that the 128 MB module region
is now no longer fully reserved for kernel modules means that there is
a very low likelihood that the normal bottom-up allocation of other
vmalloc regions may collide, and use up the range for other things.
Large model code is suboptimal, given that each symbol reference involves
a literal load that goes through the D-cache, reducing cache utilization.
But more importantly, literals are not instructions but part of .text
nonetheless, and hence mapped with executable permissions.
So let's get rid of our dependency on the large model for KASLR, by:
- reducing the full randomization range to 4 GB, thereby ensuring that
ADRP references between modules and the kernel are always in range,
- reduce the spillover range to 4 GB as well, so that we fallback to a
region that is still guaranteed to be in range
- move the randomization window of the core kernel to the middle of the
VMALLOC space
Note that KASAN always uses the module region outside of the vmalloc space,
so keep the kernel close to that if KASAN is enabled.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
When handling an OS descriptor request, one of the first operations is
to zero out the request buffer using the wLength from the setup packet.
There is no bounds checking, so a wLength > 4096 would clobber memory
adjacent to the request buffer. Fix this by taking the min of wLength
and the request buffer length prior to the memset. While at it, define
the buffer length in a header file so that magic numbers don't appear
throughout the code.
When returning data to the host, the data length should be the min of
the wLength and the valid data we have to return. Currently we are
returning wLength, thus requests for a wLength greater than the amount
of data in the OS descriptor buffer would return invalid (albeit zero'd)
data following the valid descriptor data. Fix this by counting the
number of bytes when constructing the data and using this when
determining the length of the request.
Signed-off-by: Chris Dickens <christopher.a.dickens@gmail.com>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>