Pull operating performance points (OPP) framework changes for v5.2
from Viresh Kumar:
"This pull request contains:
- New helper in OPP core to find best matching frequency for a voltage
value."
* 'opp/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
OPP: Introduce dev_pm_opp_find_freq_ceil_by_volt()
Drop the RELEVANT_IFLAG() macro which essentially hasn't been used for
over a decade except in some remnant debug printks that were recently
removed.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Clean up the throttle implementation by dropping the redundant
throttle_req flag which was a remnant from back when there was only a
single read URB.
Also convert the throttled flag to an atomic bit flag.
Signed-off-by: Johan Hovold <johan@kernel.org>
Introduce specification for Geneve decap flow with encapsulation options
and allow creation of rules that are matching on Geneve TLV options.
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Introduce support for Geneve flow specification and allow
the creation of rules that are matching on basic Geneve
protocol fields: VNI, OAM bit, protocol type, options length.
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When in switchdev mode, we would like to treat loopback RoCE
traffic (on eswitch manager) as RDMA and not as regular
Ethernet traffic
In order to enable it we add flow steering rule that forward RoCE
loopback traffic to the HW RoCE filter (by adding allow rule).
In addition we add RoCE address in GID index 0, which will be
set in the RoCE loopback packet.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Flow table supports three types of miss action:
1. Default miss action - go to default miss table according to table.
2. Go to specific table.
3. Switch domain - go to the root table of an alternative steering
table domain.
New table miss action was added - switch_domain.
The next domain for RDMA_RX namespace is the NIC RX domain.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add new flow steering namespace - MLX5_FLOW_NAMESPACE_RDMA_RX.
Flow steering rules in this namespace are used to filter
RDMA traffic.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently mlx5 core stores copy of the PCI device name in a
mlx5_priv structure and uses pr_warn, pr_err helpers.
Get rid of the copy of this name; instead store the parent device
pointer that contains name as well as dma specific parameters.
This also allows to use kernel's well defined dev_warn, dev_err, dev_dbg
device specific print routines.
This is also a preparation patch to access non PCI parent device in
future.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Current ConnectX HW is unable to perform VLAN pop in TX path and VLAN
push on RX path. To workaround that limitation untagged packets will be
tagged with VLAN ID 0x000 (priority tag) and pop/push actions will be
replaced by VLAN re-write actions (which are supported by the HW).
Introduce prio tag mode as a pre-step to controlling the workaround
behavior.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In several places in the kernel we find PCI_DEVID used like this:
PCI_DEVID(dev->bus->number, dev->devfn)
Add a "pci_dev_id(struct pci_dev *dev)" helper to simplify callers.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reset controller changes for v5.2
This adds support for 'acquired'/'released' states to exclusive reset
controls to allow drivers that usually need direct control over their
reset line, to temporarily yield control to another driver, such as a
power domain controller during power transitions. A fix adds missing
headers to linux/reset.h, which caused build errors in linux-next,
discovered by the new lima DRM driver.
* tag 'reset-for-5.2' of git://git.pengutronix.de/pza/linux:
reset: fix linux/reset.h errors
Signed-off-by: Olof Johansson <olof@lixom.net>
i.MX drivers change for 5.2:
- A series from Aisheng to generalize the SCU powerdomain driver
for easier adding new SCU based platforms like imx8qm.
- Add a generic i.MX8 SoC driver for reporting SoC and platform
information.
- Replace explicit polling loop with a call to regmap_read_poll_timeout()
for gpcv2 driver to avoid code repetition.
- Use devm_platform_ioremap_resource() to simplify gpc/gpcv2 driver
code a bit.
- Add general IRQ support for imx-scu driver, so that interrupt of
device like RTC, thermal and watchdog can be handled.
* tag 'imx-drivers-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
soc: imx: Add generic i.MX8 SoC driver
firmware: imx: enable imx scu general irq function
soc: imx: gpcv2: use devm_platform_ioremap_resource() to simplify code
soc: imx: gpc: use devm_platform_ioremap_resource() to simplify code
firmware: imx: scu-pd: decouple the SS information from domain names
firmware: imx: scu-pd: add specifying the base of domain name index support
firmware: imx: scu-pd: use bool to set postfix
soc: imx: gpcv2: Make use of regmap_read_poll_timeout()
Signed-off-by: Olof Johansson <olof@lixom.net>
Some definitions of Inner Cacheability attibutes need to be corrected.
Fixes: 8c828a535e ("irqchip/gicv3-its: Restore all cacheability attributes")
Signed-off-by: Hongbo Yao <yaohongbo@huawei.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
A kernel loaded via kexec_load cannot be verified. Thus disable kexec_load
systemcall in kernels which where IPLed securely. Use the IMA mechanism to
do so.
Signed-off-by: Philipp Rudo <prudo@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This modernizes the IXP4xx platform and adds initial Device Tree
Support. We migrate to MULTI_IRQ_HANDLER, bumps the IRQs to
offset 16, converts to SPARSE_IRQ, then we add proper subsystem
drivers in each subsystem for irqchip, GPIO and clocksource and
switch over to using these new drivers.
Next we modernize the NPE and QMGR drivers and push them down
into drivers/soc.
This has been tested on the IXP4xx NSLU2 and the Gateworks
GW2358-4.
* tag 'ixp4xx-for-armsoc' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-nomadik: (31 commits)
ARM: dts: Add queue manager and NPE to the IXP4xx DTSI
soc: ixp4xx: qmgr: Add DT probe code
soc: ixp4xx: qmgr: Add DT bindings for IXP4xx qmgr
soc: ixp4xx: npe: Add DT probe code
soc: ixp4xx: Add DT bindings for IXP4xx NPE
soc: ixp4xx: qmgr: Pass resources
soc: ixp4xx: Remove unused functions
soc: ixp4xx: Uninline several functions
soc: ixp4xx: npe: Pass addresses as resources
ARM: ixp4xx: Turn the QMGR into a platform device
ARM: ixp4xx: Turn the NPE into a platform device
ARM: ixp4xx: Move IXP4xx QMGR and NPE headers
ARM: ixp4xx: Move NPE and QMGR to drivers/soc
ARM: dts: Add some initial IXP4xx device trees
ARM: ixp4xx: Add device tree boot support
ARM: ixp4xx: Add DT bindings
gpio: ixp4xx: Add OF probing support
gpio: ixp4xx: Add DT bindings
clocksource/drivers/ixp4xx: Add OF initialization support
clocksource/drivers/ixp4xx: Add DT bindings
...
Signed-off-by: Olof Johansson <olof@lixom.net>
firmware: tegra: Changes for v5.2-rc1
This set of changes includes improvements for Trusted Foundations and
also moves the source files for this support into the standard location
under drivers/firmware.
* tag 'tegra-for-5.2-firmware' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
firmware: Move Trusted Foundations support
ARM: tegra: Sort dependencies alphabetically
ARM: tegra: Add firmware calls required for suspend-resume on Tegra30
ARM: tegra: Always boot CPU in ARM-mode
ARM: tegra: Don't apply CPU erratas in insecure mode
ARM: tegra: Set up L2 cache using Trusted Foundations firmware
ARM: trusted_foundations: Provide information about whether firmware is registered
ARM: trusted_foundations: Make prepare_idle call to take mode argument
ARM: trusted_foundations: Support L2 cache maintenance
Signed-off-by: Olof Johansson <olof@lixom.net>
soc/tegra: Changes for v5.2-rc1
Besides a couple of fixes to better cope with deferred probing, this set
of patches also implements the acquire/release protocol for resets used
during powergate operations. This is necessary to allow these resets to
be temporarily shared with other devices that may also need to control
these resets.
* tag 'tegra-for-5.2-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
soc/tegra: pmc: Move powergate initialisation to probe
soc/tegra: pmc: Remove reset sysfs entries on error
soc/tegra: pmc: Fix reset sources and levels
soc/tegra: pmc: Implement acquire/release for resets
reset: Add acquire/release support for arrays
reset: Add acquired flag to of_reset_control_array_get()
reset: add acquired/released state for exclusive reset controls
Signed-off-by: Olof Johansson <olof@lixom.net>
Currently perf callchain doesn't work well with ORC unwinder
when sampling from trace point. We'll get useless in kernel callchain
like this:
perf 6429 [000] 22.498450: kmem:mm_page_alloc: page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
ffffffffbe23e32e __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux)
7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
5651468729c1 [unknown] (/usr/bin/perf)
5651467ee82a main+0x69a (/usr/bin/perf)
7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
5541f689495641d7 [unknown] ([unknown])
The root cause is that, for trace point events, it doesn't provide a
real snapshot of the hardware registers. Instead perf tries to get
required caller's registers and compose a fake register snapshot
which suppose to contain enough information for start a unwinding.
However without CONFIG_FRAME_POINTER, if failed to get caller's BP as the
frame pointer, so current frame pointer is returned instead. We get
a invalid register combination which confuse the unwinder, and end the
stacktrace early.
So in such case just don't try dump BP, and let the unwinder start
directly when the register is not a real snapshot. Use SP
as the skip mark, unwinder will skip all the frames until it meet
the frame of the trace point caller.
Tested with frame pointer unwinder and ORC unwinder, this makes perf
callchain get the full kernel space stacktrace again like this:
perf 6503 [000] 1567.570191: kmem:mm_page_alloc: page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
ffffffffb523e2ae __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux)
ffffffffb52383bd __get_free_pages+0xd (/lib/modules/5.1.0-rc3+/build/vmlinux)
ffffffffb52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux)
ffffffffb521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux)
ffffffffb52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux)
ffffffffb52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux)
ffffffffb500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux)
ffffffffb5a0008c entry_SYSCALL_64_after_hwframe+0x44 (/lib/modules/5.1.0-rc3+/build/vmlinux)
7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
55a22960d9c1 [unknown] (/usr/bin/perf)
55a22958982a main+0x69a (/usr/bin/perf)
7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
5541f689495641d7 [unknown] ([unknown])
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190422162652.15483-1-kasong@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ep93xx does not have a proper pinctrl driver, but does things
ad-hoc through mach/platform.h, which is also used for setting
up the boards.
To avoid using mach/*.h headers completely, let's move the interfaces
into include/linux/soc/. This is far from great, but gets the job
done here, without the need for a proper pinctrl driver.
Acked-by: Alexander Sverdlin <alexander.sverdlin@gmail.com>
Acked-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Olof Johansson <olof@lixom.net>
We can communicate the clock rate using platform data rather than setting
a flag to use a particular value in the driver, which is cleaner and
avoids the dependency.
No platform in the kernel currently defines the ep93xx keypad device
structure, so this is a rather pointless excercise. Any out of tree
users are probably dead now, but if not, they have to change their
platform code to match the new platform_data structure.
Acked-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Olof Johansson <olof@lixom.net>
The header file is the only thing preventing us from building the
driver in a cross-platform configuration, so move the structure
we are interested in to the global platform_data location
and enable compile testing.
Acked-by: Alexander Sverdlin <alexander.sverdlin@gmail.com>
Acked-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Olof Johansson <olof@lixom.net>
arm64: zynqmp: SoC changes for v5.2
- Add support for ZynqMP fpga manager
- Defer some probes which depends on firmware driver to be ready
- Debugfs fix
* tag 'zynqmp-soc-for-v5.2' of https://github.com/Xilinx/linux-xlnx:
fpga manager: Adding FPGA Manager support for Xilinx zynqmp
dt-bindings: fpga: Add bindings for ZynqMP fpga driver
firmware: xilinx: Add fpga API's
drivers: Defer probe if firmware is not ready
firmware: xilinx: fix debugfs write handler
Signed-off-by: Olof Johansson <olof@lixom.net>
PM changes for am335x and am437x
This series adds support for am437x RTC-only mode in suspend. In the
RTC-only mode suspend, everything is shut down except the RTC. This
makes the power consumption very low for suspend mode.
To support RTC-only mode, we need to export omap_rtc_power_off_program()
from the rtc driver and improve PM code to save and restore the wkup
domain context. As RTC-only mode depends on the device being wired
properly for things like memory, we need to also check for the machine
type before we allow it. We also need to run DDR3 hardware leveling on
resume.
Note that there is a trivial merge conflict between the RTC branch
and these changes where the RTC branch makes tm2bcd() a void function
and the error handling parts can be just dropped.
* tag 'omap-for-v5.2/am4-pm-v2-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
ARM: OMAP2+: sleep43xx: Run EMIF HW leveling on resume path
memory: ti-emif-sram: Add ti_emif_run_hw_leveling for DDR3 hardware leveling
soc: ti: pm33xx: AM437X: Add rtc_only with ddr in self-refresh support
soc: ti: pm33xx: Move the am33xx_push_sram_idle to the top
ARM: OMAP2+: pm33xx: Add support for rtc+ddr in self refresh mode
rtc: OMAP: Add support for rtc-only mode
Signed-off-by: Olof Johansson <olof@lixom.net>
Driver changes for ti-sysc for v5.2 merge window
This series of changes for ti-sysc interconnect target module driver
gets us to the point where we can actually drop legacy platform data
for many devices in favor of device tree data.
To do this, we improve ti-sysc driver not to rely on platform data
callbacks to manage module clocks, and handle more quirks needed for
some devices. Also few minor fixes are needed, but were considered
not needed to be sent separately as they only show up with this series.
Then we drop several thousands of lines of legacy platform data for
omap4, omap5, dra7, am335x and am437x. We drop platform data for mmc,
i2c, gpio and uart devices to start with as those are typically
easily tested on all devices. In case of unexpected issues, we can just
add back the legacy platform data for a single device type if needed.
Finally we add initial support for enabling and disabling some devices
without legacy platform data callbacks. I was planning on sending the
dropping of legacy platform data as a separate series, but already
applied Roger's patch on top and pushed it out.
Note that this series depends on related SoC and is based on those.
* tag 'omap-for-v5.2/ti-sysc-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap: (33 commits)
bus: ti-sysc: Add generic enable/disable functions
ARM: OMAP2+: Drop mcspi platform data for omap4
ARM: OMAP2+: Drop uart platform data for dra7
ARM: OMAP2+: Drop gpio platform data for dra7
ARM: OMAP2+: Drop i2c platform data for dra7
ARM: OMAP2+: Drop mmc platform data for dra7
ARM: OMAP2+: Drop uart platform data for omap5
ARM: OMAP2+: Drop gpio platform data for omap5
ARM: OMAP2+: Drop i2c platform data for omap5
ARM: OMAP2+: Drop mmc platform data for omap5
ARM: OMAP2+: Drop uart platform data for am33xx and am43xx
ARM: OMAP2+: Drop gpio platform data for am33xx and am43xx
ARM: OMAP2+: Drop i2c platform data for am33xx and am43xx
ARM: OMAP2+: Drop mmc platform data for am330x and am43xx
ARM: OMAP2+: Drop uart platform data for omap4
ARM: OMAP2+: Drop gpio platform data for omap4
ARM: OMAP2+: Drop i2c platform data for omap4
ARM: OMAP2+: Drop mmc platform data for omap4
Documentation: bus: ti-sysc: fix spelling mistakes "multipe" and "interconnet"
bus: ti-sysc: Detect DMIC for debugging
...
Signed-off-by: Olof Johansson <olof@lixom.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-04-28
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Introduce BPF socket local storage map so that BPF programs can store
private data they associate with a socket (instead of e.g. separate hash
table), from Martin.
2) Add support for bpftool to dump BTF types. This is done through a new
`bpftool btf dump` sub-command, from Andrii.
3) Enable BPF-based flow dissector for skb-less eth_get_headlen() calls which
was currently not supported since skb was used to lookup netns, from Stanislav.
4) Add an opt-in interface for tracepoints to expose a writable context
for attached BPF programs, used here for NBD sockets, from Matt.
5) BPF xadd related arm64 JIT fixes and scalability improvements, from Daniel.
6) Change the skb->protocol for bpf_skb_adjust_room() helper in order to
support tunnels such as sit. Add selftests as well, from Willem.
7) Various smaller misc fixes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After the previous commit, both ipset_nest_start() and ipset_nest_end() are
just aliases for nla_nest_start() and nla_nest_end() so that there is no
need to keep them.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.
Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().
Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:
@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)
@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There seems to be no reason for tls_ops to be defined in netdevice.h
which is included in a lot of places. Don't wrap the struct/enum
declaration in ifdefs, it trickles down unnecessary ifdefs into
driver code.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After allowing a bpf prog to
- directly read the skb->sk ptr
- get the fullsock bpf_sock by "bpf_sk_fullsock()"
- get the bpf_tcp_sock by "bpf_tcp_sock()"
- get the listener sock by "bpf_get_listener_sock()"
- avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
into different bpf running context.
this patch is another effort to make bpf's network programming
more intuitive to do (together with memory and performance benefit).
When bpf prog needs to store data for a sk, the current practice is to
define a map with the usual 4-tuples (src/dst ip/port) as the key.
If multiple bpf progs require to store different sk data, multiple maps
have to be defined. Hence, wasting memory to store the duplicated
keys (i.e. 4 tuples here) in each of the bpf map.
[ The smallest key could be the sk pointer itself which requires
some enhancement in the verifier and it is a separate topic. ]
Also, the bpf prog needs to clean up the elem when sk is freed.
Otherwise, the bpf map will become full and un-usable quickly.
The sk-free tracking currently could be done during sk state
transition (e.g. BPF_SOCK_OPS_STATE_CB).
The size of the map needs to be predefined which then usually ended-up
with an over-provisioned map in production. Even the map was re-sizable,
while the sk naturally come and go away already, this potential re-size
operation is arguably redundant if the data can be directly connected
to the sk itself instead of proxy-ing through a bpf map.
This patch introduces sk->sk_bpf_storage to provide local storage space
at sk for bpf prog to use. The space will be allocated when the first bpf
prog has created data for this particular sk.
The design optimizes the bpf prog's lookup (and then optionally followed by
an inline update). bpf_spin_lock should be used if the inline update needs
to be protected.
BPF_MAP_TYPE_SK_STORAGE:
-----------------------
To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
this patch) needs to be created. Multiple BPF_MAP_TYPE_SK_STORAGE maps can
be created to fit different bpf progs' needs. The map enforces
BTF to allow printing the sk-local-storage during a system-wise
sk dump (e.g. "ss -ta") in the future.
The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
a "sk-local-storage" data from a particular sk.
Think of the map as a meta-data (or "type") of a "sk-local-storage". This
particular "type" of "sk-local-storage" data can then be stored in any sk.
The main purposes of this map are mostly:
1. Define the size of a "sk-local-storage" type.
2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
map-id, map-btf...etc.)
3. Keep track of all sk's storages of this "type" and clean them up
when the map is freed.
sk->sk_bpf_storage:
------------------
The main lookup/update/delete is done on sk->sk_bpf_storage (which
is a "struct bpf_sk_storage"). When doing a lookup,
the "map" pointer is now used as the "key" to search on the
sk_storage->list. The "map" pointer is actually serving
as the "type" of the "sk-local-storage" that is being
requested.
To allow very fast lookup, it should be as fast as looking up an
array at a stable-offset. At the same time, it is not ideal to
set a hard limit on the number of sk-local-storage "type" that the
system can have. Hence, this patch takes a cache approach.
The last search result from sk_storage->list is cached in
sk_storage->cache[] which is a stable sized array. Each
"sk-local-storage" type has a stable offset to the cache[] array.
In the future, a map's flag could be introduced to do cache
opt-out/enforcement if it became necessary.
The cache size is 16 (i.e. 16 types of "sk-local-storage").
Programs can share map. On the program side, having a few bpf_progs
running in the networking hotpath is already a lot. The bpf_prog
should have already consolidated the existing sock-key-ed map usage
to minimize the map lookup penalty. 16 has enough runway to grow.
All sk-local-storage data will be removed from sk->sk_bpf_storage
during sk destruction.
bpf_sk_storage_get() and bpf_sk_storage_delete():
------------------------------------------------
Instead of using bpf_map_(lookup|update|delete)_elem(),
the bpf prog needs to use the new helper bpf_sk_storage_get() and
bpf_sk_storage_delete(). The verifier can then enforce the
ARG_PTR_TO_SOCKET argument. The bpf_sk_storage_get() also allows to
"create" new elem if one does not exist in the sk. It is done by
the new BPF_SK_STORAGE_GET_F_CREATE flag. An optional value can also be
provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock. Together,
it has eliminated the potential use cases for an equivalent
bpf_map_update_elem() API (for bpf_prog) in this patch.
Misc notes:
----------
1. map_get_next_key is not supported. From the userspace syscall
perspective, the map has the socket fd as the key while the map
can be shared by pinned-file or map-id.
Since btf is enforced, the existing "ss" could be enhanced to pretty
print the local-storage.
Supporting a kernel defined btf with 4 tuples as the return key could
be explored later also.
2. The sk->sk_lock cannot be acquired. Atomic operations is used instead.
e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
Please refer to the source code comments for the details in
synchronization cases and considerations.
3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
Benchmark:
---------
Here is the benchmark data collected by turning on
the "kernel.bpf_stats_enabled" sysctl.
Two bpf progs are tested:
One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
sk ptr as the key. (verifier is modified to support sk ptr as the key
That should have shortened the key lookup time.)
Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
each egress skb and then bump the cnt. netperf is used to drive
data with 4096 connected UDP sockets.
BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
27: cgroup_skb name egress_sk_map tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
loaded_at 2019-04-15T13:46:39-0700 uid 0
xlated 344B jited 258B memlock 4096B map_ids 16
btf_id 5
BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
30: cgroup_skb name egress_sk_stora tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
loaded_at 2019-04-15T13:47:54-0700 uid 0
xlated 168B jited 156B memlock 4096B map_ids 17
btf_id 6
Here is a high-level picture on how are the objects organized:
sk
┌──────┐
│ │
│ │
│ │
│*sk_bpf_storage─────▶ bpf_sk_storage
└──────┘ ┌───────┐
┌───────────┤ list │
│ │ │
│ │ │
│ │ │
│ └───────┘
│
│ elem
│ ┌────────┐
├─▶│ snode │
│ ├────────┤
│ │ data │ bpf_map
│ ├────────┤ ┌─────────┐
│ │map_node│◀─┬─────┤ list │
│ └────────┘ │ │ │
│ │ │ │
│ elem │ │ │
│ ┌────────┐ │ └─────────┘
└─▶│ snode │ │
├────────┤ │
bpf_map │ data │ │
┌─────────┐ ├────────┤ │
│ list ├───────▶│map_node│ │
│ │ └────────┘ │
│ │ │
│ │ elem │
└─────────┘ ┌────────┐ │
┌─▶│ snode │ │
│ ├────────┤ │
│ │ data │ │
│ ├────────┤ │
│ │map_node│◀─┘
│ └────────┘
│
│
│ ┌───────┐
sk └──────────│ list │
┌──────┐ │ │
│ │ │ │
│ │ │ │
│ │ └───────┘
│*sk_bpf_storage───────▶bpf_sk_storage
└──────┘
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Mika writes:
thunderbolt: Changes for v5.2 merge window
This improves software connection manager on older Apple systems with
Thunderbolt 1 and 2 controller to support full PCIe daisy chains,
Display Port tunneling and P2P networking. There are also fixes for
potential NULL pointer dereferences at various places in the driver.
* tag 'thunderbolt-for-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt: (44 commits)
thunderbolt: Make priority unsigned in struct tb_path
thunderbolt: Start firmware on Titan Ridge Apple systems
thunderbolt: Reword output of tb_dump_hop()
thunderbolt: Make rest of the logging to happen at debug level
thunderbolt: Make __TB_[SW|PORT]_PRINT take const parameters
thunderbolt: Add support for XDomain connections
thunderbolt: Make tb_switch_alloc() return ERR_PTR()
thunderbolt: Add support for DMA tunnels
thunderbolt: Add XDomain UUID exchange support
thunderbolt: Run tb_xdp_handle_request() in system workqueue
thunderbolt: Do not tear down tunnels when driver is unloaded
thunderbolt: Add support for Display Port tunnels
thunderbolt: Rework NFC credits handling
thunderbolt: Generalize port finding routines to support all port types
thunderbolt: Scan only valid NULL adapter ports in hotplug
thunderbolt: Add support for full PCIe daisy chains
thunderbolt: Discover preboot PCIe paths the boot firmware established
thunderbolt: Deactivate all paths before restarting them
thunderbolt: Extend tunnel creation to more than 2 adjacent switches
thunderbolt: Add helper function to iterate from one port to another
...
This is an opt-in interface that allows a tracepoint to provide a safe
buffer that can be written from a BPF_PROG_TYPE_RAW_TRACEPOINT program.
The size of the buffer must be a compile-time constant, and is checked
before allowing a BPF program to attach to a tracepoint that uses this
feature.
The pointer to this buffer will be the first argument of tracepoints
that opt in; the pointer is valid and can be bpf_probe_read() by both
BPF_PROG_TYPE_RAW_TRACEPOINT and BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
programs that attach to such a tracepoint, but the buffer to which it
points may only be written by the latter.
Signed-off-by: Matt Mullins <mmullins@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When we create a new lockd client, we want to be able to pass the
correct credential of the process that created the struct nlm_host.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Store the credential of the mount process so that we can determine
information such as the user namespace.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Johannes Berg says:
====================
Various updates, notably:
* extended key ID support (from 802.11-2016)
* per-STA TX power control support
* mac80211 TX performance improvements
* HE (802.11ax) updates
* mesh link probing support
* enhancements of multi-BSSID support (also related to HE)
* OWE userspace processing support
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When converting kuids to AUTH_UNIX creds, etc we will want to use the
same user namespace as the process that created the rpc client.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Pull tracing fixes from Steven Rostedt:
"Three tracing fixes:
- Use "nosteal" for ring buffer splice pages
- Memory leak fix in error path of trace_pid_write()
- Fix preempt_enable_no_resched() (use preempt_enable()) in ring
buffer code"
* tag 'trace-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
trace: Fix preempt_enable_no_resched() abuse
tracing: Fix a memory leak by early error exit in trace_pid_write()
tracing: Fix buffer_ref pipe ops
note that conditions surrounding accesses to dname in audit_watch_handle_event()
and audit_mark_handle_event() guarantee that dname won't have been NULL.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This flag was historically used to indicate that a clk is a "basic" type
of clk like a mux, divider, gate, etc. This never turned out to be very
useful though because it was hard to cleanly split "basic" clks from
other clks in a system. This one flag was a way for type introspection
and it just didn't scale. If anything, it was used by the TI clk driver
to indicate that a clk_hw wasn't contained in the SoC specific clk
structure. We can get rid of this define now that TI is finding those
clks a different way.
Cc: Tero Kristo <t-kristo@ti.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: <linux-mips@vger.kernel.org>
Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Kevin Hilman <khilman@baylibre.com>
Cc: <linux-pwm@vger.kernel.org>
Cc: <linux-amlogic@lists.infradead.org>
Acked-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Note that in fnsotify_move() and fsnotify_link() we are guaranteed
that dentry->d_name won't change during the fsnotify() evaluation
(by having the parent directory locked exclusive), so we don't
need to fetch dentry->d_name.name in the callers. In fsnotify_dirent()
the same stability of dentry->d_name is also true, but it's a bit
more convoluted - there is one callchain (devpts_pty_new() ->
fsnotify_create() -> fsnotify_dirent()) where the parent is _not_
locked, but on devpts ->d_name of everything is unchanging; it
has neither explicit nor implicit renames.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
note that in the second (RENAME_EXCHANGE) call of fsnotify_move() in
vfs_rename() the old_dentry->d_name is guaranteed to be unchanged
throughout the evaluation of fsnotify_move() (by the fact that the
parent directory is locked exclusive), so we don't need to fetch
old_dentry->d_name.name in the caller.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>