Changes in 5.10.71
tty: Fix out-of-bound vmalloc access in imageblit
cpufreq: schedutil: Use kobject release() method to free sugov_tunables
scsi: qla2xxx: Changes to support kdump kernel for NVMe BFS
cpufreq: schedutil: Destroy mutex before kobject_put() frees the memory
usb: cdns3: fix race condition before setting doorbell
ALSA: hda/realtek: Quirks to enable speaker output for Lenovo Legion 7i 15IMHG05, Yoga 7i 14ITL5/15ITL5, and 13s Gen2 laptops.
ACPI: NFIT: Use fallback node id when numa info in NFIT table is incorrect
fs-verity: fix signed integer overflow with i_size near S64_MAX
hwmon: (tmp421) handle I2C errors
hwmon: (w83793) Fix NULL pointer dereference by removing unnecessary structure field
hwmon: (w83792d) Fix NULL pointer dereference by removing unnecessary structure field
hwmon: (w83791d) Fix NULL pointer dereference by removing unnecessary structure field
gpio: pca953x: do not ignore i2c errors
scsi: ufs: Fix illegal offset in UPIU event trace
mac80211: fix use-after-free in CCMP/GCMP RX
x86/kvmclock: Move this_cpu_pvti into kvmclock.h
KVM: x86: Fix stack-out-of-bounds memory access from ioapic_write_indirect()
KVM: x86: nSVM: don't copy virt_ext from vmcb12
KVM: nVMX: Filter out all unsupported controls when eVMCS was activated
KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest
media: ir_toy: prevent device from hanging during transmit
RDMA/cma: Do not change route.addr.src_addr.ss_family
drm/amd/display: Pass PCI deviceid into DC
drm/amdgpu: correct initial cp_hqd_quantum for gfx9
ipvs: check that ip_vs_conn_tab_bits is between 8 and 20
bpf: Handle return value of BPF_PROG_TYPE_STRUCT_OPS prog
IB/cma: Do not send IGMP leaves for sendonly Multicast groups
RDMA/cma: Fix listener leak in rdma_cma_listen_on_all() failure
bpf, mips: Validate conditional branch offsets
hwmon: (mlxreg-fan) Return non-zero value when fan current state is enforced from sysfs
mac80211: Fix ieee80211_amsdu_aggregate frag_tail bug
mac80211: limit injected vht mcs/nss in ieee80211_parse_tx_radiotap
mac80211: mesh: fix potentially unaligned access
mac80211-hwsim: fix late beacon hrtimer handling
sctp: break out if skb_header_pointer returns NULL in sctp_rcv_ootb
mptcp: don't return sockets in foreign netns
hwmon: (tmp421) report /PVLD condition as fault
hwmon: (tmp421) fix rounding for negative values
net: enetc: fix the incorrect clearing of IF_MODE bits
net: ipv4: Fix rtnexthop len when RTA_FLOW is present
smsc95xx: fix stalled rx after link change
drm/i915/request: fix early tracepoints
dsa: mv88e6xxx: 6161: Use chip wide MAX MTU
dsa: mv88e6xxx: Fix MTU definition
dsa: mv88e6xxx: Include tagger overhead when setting MTU for DSA and CPU ports
e100: fix length calculation in e100_get_regs_len
e100: fix buffer overrun in e100_get_regs
RDMA/hns: Fix inaccurate prints
bpf: Exempt CAP_BPF from checks against bpf_jit_limit
selftests, bpf: Fix makefile dependencies on libbpf
selftests, bpf: test_lwt_ip_encap: Really disable rp_filter
net: ks8851: fix link error
Revert "block, bfq: honor already-setup queue merges"
scsi: csiostor: Add module softdep on cxgb4
ixgbe: Fix NULL pointer dereference in ixgbe_xdp_setup
net: hns3: do not allow call hns3_nic_net_open repeatedly
net: hns3: keep MAC pause mode when multiple TCs are enabled
net: hns3: fix mixed flag HCLGE_FLAG_MQPRIO_ENABLE and HCLGE_FLAG_DCB_ENABLE
net: hns3: fix show wrong state when add existing uc mac address
net: hns3: fix prototype warning
net: hns3: reconstruct function hns3_self_test
net: hns3: fix always enable rx vlan filter problem after selftest
net: phy: bcm7xxx: Fixed indirect MMD operations
net: sched: flower: protect fl_walk() with rcu
af_unix: fix races in sk_peer_pid and sk_peer_cred accesses
perf/x86/intel: Update event constraints for ICX
hwmon: (pmbus/mp2975) Add missed POUT attribute for page 1 mp2975 controller
nvme: add command id quirk for apple controllers
elf: don't use MAP_FIXED_NOREPLACE for elf interpreter mappings
debugfs: debugfs_create_file_size(): use IS_ERR to check for error
ipack: ipoctal: fix stack information leak
ipack: ipoctal: fix tty registration race
ipack: ipoctal: fix tty-registration error handling
ipack: ipoctal: fix missing allocation-failure check
ipack: ipoctal: fix module reference leak
ext4: fix loff_t overflow in ext4_max_bitmap_size()
ext4: limit the number of blocks in one ADD_RANGE TLV
ext4: fix reserved space counter leakage
ext4: add error checking to ext4_ext_replay_set_iblocks()
ext4: fix potential infinite loop in ext4_dx_readdir()
HID: u2fzero: ignore incomplete packets without data
net: udp: annotate data race around udp_sk(sk)->corkflag
ASoC: dapm: use component prefix when checking widget names
usb: hso: remove the bailout parameter
crypto: ccp - fix resource leaks in ccp_run_aes_gcm_cmd()
HID: betop: fix slab-out-of-bounds Write in betop_probe
netfilter: ipset: Fix oversized kvmalloc() calls
mm: don't allow oversized kvmalloc() calls
HID: usbhid: free raw_report buffers in usbhid_stop
KVM: x86: Handle SRCU initialization failure during page track init
netfilter: conntrack: serialize hash resizes and cleanups
netfilter: nf_tables: Fix oversized kvmalloc() calls
Linux 5.10.71
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I238c3de739c3d4ba0a04a484460356161899f222
[ Upstream commit 356ed64991c6847a0c4f2e8fa3b1133f7a14f1fc ]
Currently if a function ptr in struct_ops has a return value, its
caller will get a random return value from it, because the return
value of related BPF_PROG_TYPE_STRUCT_OPS prog is just dropped.
So adding a new flag BPF_TRAMP_F_RET_FENTRY_RET to tell bpf trampoline
to save and return the return value of struct_ops prog if ret_size of
the function ptr is greater than 0. Also restricting the flag to be
used alone.
Fixes: 85d33df357 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210914023351.3664499-1-houtao1@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
This reverts commit bc751d322e as the
kabi can be updated at this point in time.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic38de1d64f2f581383836fe5036b9202a472554a
This reverts commit e21d2b9235
It breaks the abi but we can bring it back later on when the KABI update
happens in a few days.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I7a5861c037be3e35973893d8c91eda9133bf8595
Changes in 5.10.28
arm64: mm: correct the inside linear map range during hotplug check
bpf: Fix fexit trampoline.
virtiofs: Fail dax mount if device does not support it
ext4: shrink race window in ext4_should_retry_alloc()
ext4: fix bh ref count on error paths
fs: nfsd: fix kconfig dependency warning for NFSD_V4
rpc: fix NULL dereference on kmalloc failure
iomap: Fix negative assignment to unsigned sis->pages in iomap_swapfile_activate
ASoC: rt1015: fix i2c communication error
ASoC: rt5640: Fix dac- and adc- vol-tlv values being off by a factor of 10
ASoC: rt5651: Fix dac- and adc- vol-tlv values being off by a factor of 10
ASoC: sgtl5000: set DAP_AVC_CTRL register to correct default value on probe
ASoC: es8316: Simplify adc_pga_gain_tlv table
ASoC: soc-core: Prevent warning if no DMI table is present
ASoC: cs42l42: Fix Bitclock polarity inversion
ASoC: cs42l42: Fix channel width support
ASoC: cs42l42: Fix mixer volume control
ASoC: cs42l42: Always wait at least 3ms after reset
NFSD: fix error handling in NFSv4.0 callbacks
kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing
vhost: Fix vhost_vq_reset()
io_uring: fix ->flags races by linked timeouts
scsi: st: Fix a use after free in st_open()
scsi: qla2xxx: Fix broken #endif placement
staging: comedi: cb_pcidas: fix request_irq() warn
staging: comedi: cb_pcidas64: fix request_irq() warn
ASoC: rt5659: Update MCLK rate in set_sysclk()
ASoC: rt711: add snd_soc_component remove callback
thermal/core: Add NULL pointer check before using cooling device stats
locking/ww_mutex: Simplify use_ww_ctx & ww_ctx handling
locking/ww_mutex: Fix acquire/release imbalance in ww_acquire_init()/ww_acquire_fini()
nvmet-tcp: fix kmap leak when data digest in use
io_uring: imply MSG_NOSIGNAL for send[msg]()/recv[msg]() calls
static_call: Align static_call_is_init() patching condition
ext4: do not iput inode under running transaction in ext4_rename()
io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL
net: mvpp2: fix interrupt mask/unmask skip condition
flow_dissector: fix TTL and TOS dissection on IPv4 fragments
can: dev: move driver related infrastructure into separate subdir
net: introduce CAN specific pointer in the struct net_device
can: tcan4x5x: fix max register value
brcmfmac: clear EAP/association status bits on linkdown events
ath11k: add ieee80211_unregister_hw to avoid kernel crash caused by NULL pointer
rtw88: coex: 8821c: correct antenna switch function
netdevsim: dev: Initialize FIB module after debugfs
iwlwifi: pcie: don't disable interrupts for reg_lock
ath10k: hold RCU lock when calling ieee80211_find_sta_by_ifaddr()
net: ethernet: aquantia: Handle error cleanup of start on open
appletalk: Fix skb allocation size in loopback case
net: ipa: remove two unused register definitions
net: ipa: fix register write command validation
net: wan/lmc: unregister device when no matching device is found
net: 9p: advance iov on empty read
bpf: Remove MTU check in __bpf_skb_max_len
ACPI: tables: x86: Reserve memory occupied by ACPI tables
ACPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()
ALSA: usb-audio: Apply sample rate quirk to Logitech Connect
ALSA: hda: Re-add dropped snd_poewr_change_state() calls
ALSA: hda: Add missing sanity checks in PM prepare/complete callbacks
ALSA: hda/realtek: fix a determine_headset_type issue for a Dell AIO
ALSA: hda/realtek: call alc_update_headset_mode() in hp_automute_hook
ALSA: hda/realtek: fix mute/micmute LEDs for HP 640 G8
xtensa: fix uaccess-related livelock in do_page_fault
xtensa: move coprocessor_flush to the .text section
KVM: SVM: load control fields from VMCB12 before checking them
KVM: SVM: ensure that EFER.SVME is set when running nested guest or on nested vmexit
PM: runtime: Fix race getting/putting suppliers at probe
PM: runtime: Fix ordering in pm_runtime_get_suppliers()
tracing: Fix stack trace event size
s390/vdso: copy tod_steering_delta value to vdso_data page
s390/vdso: fix tod_steering_delta type
mm: fix race by making init_zero_pfn() early_initcall
drm/amdkfd: dqm fence memory corruption
drm/amdgpu: fix offset calculation in amdgpu_vm_bo_clear_mappings()
drm/amdgpu: check alignment on CPU page for bo map
reiserfs: update reiserfs_xattrs_initialized() condition
drm/imx: fix memory leak when fails to init
drm/tegra: dc: Restore coupling of display controllers
drm/tegra: sor: Grab runtime PM reference across reset
vfio/nvlink: Add missing SPAPR_TCE_IOMMU depends
pinctrl: rockchip: fix restore error in resume
extcon: Add stubs for extcon_register_notifier_all() functions
extcon: Fix error handling in extcon_dev_register
firmware: stratix10-svc: reset COMMAND_RECONFIG_FLAG_PARTIAL to 0
usb: dwc3: pci: Enable dis_uX_susphy_quirk for Intel Merrifield
video: hyperv_fb: Fix a double free in hvfb_probe
firewire: nosy: Fix a use-after-free bug in nosy_ioctl()
usbip: vhci_hcd fix shift out-of-bounds in vhci_hub_control()
USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem
usb: musb: Fix suspend with devices connected for a64
usb: xhci-mtk: fix broken streams issue on 0.96 xHCI
cdc-acm: fix BREAK rx code path adding necessary calls
USB: cdc-acm: untangle a circular dependency between callback and softint
USB: cdc-acm: downgrade message to debug
USB: cdc-acm: fix double free on probe failure
USB: cdc-acm: fix use-after-free after probe failure
usb: gadget: udc: amd5536udc_pci fix null-ptr-dereference
usb: dwc2: Fix HPRT0.PrtSusp bit setting for HiKey 960 board.
usb: dwc2: Prevent core suspend when port connection flag is 0
usb: dwc3: qcom: skip interconnect init for ACPI probe
usb: dwc3: gadget: Clear DEP flags after stop transfers in ep disable
soc: qcom-geni-se: Cleanup the code to remove proxy votes
staging: rtl8192e: Fix incorrect source in memcpy()
staging: rtl8192e: Change state information from u16 to u8
driver core: clear deferred probe reason on probe retry
drivers: video: fbcon: fix NULL dereference in fbcon_cursor()
riscv: evaluate put_user() arg before enabling user access
Revert "kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing"
bpf: Use NOP_ATOMIC5 instead of emit_nops(&prog, 5) for BPF_TRAMP_F_CALL_ORIG
Linux 5.10.28
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ifdbbeda8de3ee22a7aa3f5d3b10becf0aba1a124
[ Upstream commit e21aa341785c679dd409c8cb71f864c00fe6c463 ]
The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
The synchronize_rcu_tasks() will not wait for such tasks to complete.
In such case the trampoline image will be freed and when the task
wakes up the return IP will point to freed memory causing the crash.
Solve this by adding percpu_ref_get/put for the duration of trampoline
and separate trampoline vs its image life times.
The "half page" optimization has to be removed, since
first_half->second_half->first_half transition cannot be guaranteed to
complete in deterministic time. Every trampoline update becomes a new image.
The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
call_rcu_tasks. Together they will wait for the original function and
trampoline asm to complete. The trampoline is patched from nop to jmp to skip
fexit progs. They are freed independently from the trampoline. The image with
fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
will wait for both sleepable and non-sleepable progs to complete.
Fixes: fec56f5890 ("bpf: Introduce BPF trampoline")
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Paul E. McKenney <paulmck@kernel.org> # for RCU
Link: https://lore.kernel.org/bpf/20210316210007.38949-1-alexei.starovoitov@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Add vendor hook for bpf, so we can get memory type and
use it to do memory type check for architecture
dependent page table setting.
Bug: 181639260
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Change-Id: Icac325a040fb88c7f6b04b2409029b623bd8515f
Moving btf_resolve_size into __btf_resolve_size and
keeping btf_resolve_size public with just first 3
arguments, because the rest of the arguments are not
used by outside callers.
Following changes are adding more arguments, which
are not useful to outside callers. They will be added
to the __btf_resolve_size function.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200825192124.710397-4-jolsa@kernel.org
Implement permissions as stated in uapi/linux/capability.h
In order to do that the verifier allow_ptr_leaks flag is split
into four flags and they are set as:
env->allow_ptr_leaks = bpf_allow_ptr_leaks();
env->bypass_spec_v1 = bpf_bypass_spec_v1();
env->bypass_spec_v4 = bpf_bypass_spec_v4();
env->bpf_capable = bpf_capable();
The first three currently equivalent to perfmon_capable(), since leaking kernel
pointers and reading kernel memory via side channel attacks is roughly
equivalent to reading kernel memory with cap_perfmon.
'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
verifier, run time mitigations in bpf array, and enables indirect variable
access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
by the verifier.
That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
will have speculative checks done by the verifier and other spectre mitigation
applied. Such networking BPF program will not be able to leak kernel pointers
and will not be able to access arbitrary kernel memory.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
Overlapping header include additions in macsec.c
A bug fix in 'net' overlapping with the removal of 'version'
string in ena_netdev.c
Overlapping test additions in selftests Makefile
Overlapping PCI ID table adjustments in iwlwifi driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
The current always succeed behavior in bpf_struct_ops_map_delete_elem()
is not ideal for userspace tool. It can be improved to return proper
error value.
If it is in TOBEFREE, it means unregistration has been already done
before but it is in progress and waiting for the subsystem to clear
the refcnt to zero, so -EINPROGRESS.
If it is INIT, it means the struct_ops has not been registered yet,
so -ENOENT.
Fixes: 85d33df357 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200305013447.535326-1-kafai@fb.com
As we need to introduce a third type of attachment for trampolines, the
flattened signature of arch_prepare_bpf_trampoline gets even more
complicated.
Refactor the prog and count argument to arch_prepare_bpf_trampoline to
use bpf_tramp_progs to simplify the addition and accounting for new
attachment types.
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200304191853.1529-2-kpsingh@chromium.org
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200227001744.GA3317@embeddedor
Instead of using a locally defined "struct bpf_verifier_log log = {}",
btf_struct_ops_init() should reuse the "log" from its calling
function "btf_parse_vmlinux()". It should also resolve the
frame-size too large compiler warning in some ARCH.
Fixes: 27ae7997a6 ("bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200127175145.1154438-1-kafai@fb.com
Instead of using bpf_struct_ops_map_lookup_elem() which is
not implemented, bpf_struct_ops_map_seq_show_elem() should
also use bpf_struct_ops_map_sys_lookup_elem() which does
an inplace update to the value. The change allocates
a value to pass to bpf_struct_ops_map_sys_lookup_elem().
[root@arch-fb-vm1 bpf]# cat /sys/fs/bpf/dctcp
{{{1}},BPF_STRUCT_OPS_STATE_INUSE,{{00000000df93eebc,00000000df93eebc},0,2, ...
Fixes: 85d33df357 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200114072647.3188298-1-kafai@fb.com
The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value
is a kernel struct with its func ptr implemented in bpf prog.
This new map is the interface to register/unregister/introspect
a bpf implemented kernel struct.
The kernel struct is actually embedded inside another new struct
(or called the "value" struct in the code). For example,
"struct tcp_congestion_ops" is embbeded in:
struct bpf_struct_ops_tcp_congestion_ops {
refcount_t refcnt;
enum bpf_struct_ops_state state;
struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */
}
The map value is "struct bpf_struct_ops_tcp_congestion_ops".
The "bpftool map dump" will then be able to show the
state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
number of tcp_sock in the tcp_congestion_ops case). This "value" struct
is created automatically by a macro. Having a separate "value" struct
will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
"void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
initialization works before registering the struct_ops to the kernel
subsystem). The libbpf will take care of finding and populating the
"struct bpf_struct_ops_XYZ" from "struct XYZ".
Register a struct_ops to a kernel subsystem:
1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
running kernel.
Instead of reusing the attr->btf_value_type_id,
btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
used as the "user" btf which could store other useful sysadmin/debug
info that may be introduced in the furture,
e.g. creation-date/compiler-details/map-creator...etc.
3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
in the running kernel btf. Populate the value of this object.
The function ptr should be populated with the prog fds.
4. Call BPF_MAP_UPDATE with the object created in (3) as
the map value. The key is always "0".
During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
args as an array of u64 is generated. BPF_MAP_UPDATE also allows
the specific struct_ops to do some final checks in "st_ops->init_member()"
(e.g. ensure all mandatory func ptrs are implemented).
If everything looks good, it will register this kernel struct
to the kernel subsystem. The map will not allow further update
from this point.
Unregister a struct_ops from the kernel subsystem:
BPF_MAP_DELETE with key "0".
Introspect a struct_ops:
BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will
have the prog _id_ populated as the func ptr.
The map value state (enum bpf_struct_ops_state) will transit from:
INIT (map created) =>
INUSE (map updated, i.e. reg) =>
TOBEFREE (map value deleted, i.e. unreg)
The kernel subsystem needs to call bpf_struct_ops_get() and
bpf_struct_ops_put() to manage the "refcnt" in the
"struct bpf_struct_ops_XYZ". This patch uses a separate refcnt
for the purose of tracking the subsystem usage. Another approach
is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
the subsystem's usage by doing map->refcnt - map->usercnt to filter out
the map-fd/pinned-map usage. However, that will also tie down the
future semantics of map->refcnt and map->usercnt.
The very first subsystem's refcnt (during reg()) holds one
count to map->refcnt. When the very last subsystem's refcnt
is gone, it will also release the map->refcnt. All bpf_prog will be
freed when the map->refcnt reaches 0 (i.e. during map_free()).
Here is how the bpftool map command will look like:
[root@arch-fb-vm1 bpf]# bpftool map show
6: struct_ops name dctcp flags 0x0
key 4B value 256B max_entries 1 memlock 4096B
btf_id 6
[root@arch-fb-vm1 bpf]# bpftool map dump id 6
[{
"value": {
"refcnt": {
"refs": {
"counter": 1
}
},
"state": 1,
"data": {
"list": {
"next": 0,
"prev": 0
},
"key": 0,
"flags": 2,
"init": 24,
"release": 0,
"ssthresh": 25,
"cong_avoid": 30,
"set_state": 27,
"cwnd_event": 28,
"in_ack_event": 26,
"undo_cwnd": 29,
"pkts_acked": 0,
"min_tso_segs": 0,
"sndbuf_expand": 0,
"cong_control": 0,
"get_info": 0,
"name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
],
"owner": 0
}
}
}
]
Misc Notes:
* bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
It does an inplace update on "*value" instead returning a pointer
to syscall.c. Otherwise, it needs a separate copy of "zero" value
for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
* The bpf_struct_ops_map_delete_elem() is also called without
preempt_disable() from map_delete_elem(). It is because
the "->unreg()" may requires sleepable context, e.g.
the "tcp_unregister_congestion_control()".
* "const" is added to some of the existing "struct btf_func_model *"
function arg to avoid a compiler warning caused by this patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com
This patch allows the kernel's struct ops (i.e. func ptr) to be
implemented in BPF. The first use case in this series is the
"struct tcp_congestion_ops" which will be introduced in a
latter patch.
This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
func ptr of a kernel struct. The attr->attach_btf_id is the btf id
of a kernel struct. The attr->expected_attach_type is the member
"index" of that kernel struct. The first member of a struct starts
with member index 0. That will avoid ambiguity when a kernel struct
has multiple func ptrs with the same func signature.
For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
to implement the "init" func ptr of the "struct tcp_congestion_ops".
The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
of the _running_ kernel. The attr->expected_attach_type is 3.
The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
by arch_prepare_bpf_trampoline that will be done in the next
patch when introducing BPF_MAP_TYPE_STRUCT_OPS.
"struct bpf_struct_ops" is introduced as a common interface for the kernel
struct that supports BPF_PROG_TYPE_STRUCT_OPS prog. The supporting kernel
struct will need to implement an instance of the "struct bpf_struct_ops".
The supporting kernel struct also needs to implement a bpf_verifier_ops.
During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
bpf_verifier_ops by searching the attr->attach_btf_id.
A new "btf_struct_access" is also added to the bpf_verifier_ops such
that the supporting kernel struct can optionally provide its own specific
check on accessing the func arg (e.g. provide limited write access).
After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
to initialize some values (e.g. the btf id of the supporting kernel
struct) and it can only be done once the btf_vmlinux is available.
The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
if the return type of the prog->aux->attach_func_proto is "void".
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003503.3855825-1-kafai@fb.com