Convert some existing tests that attach to tracepoints to use
bpf_program__attach_tracepoint API instead.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Use new bpf_program__attach_perf_event() in test previously relying on
direct ioctl manipulations.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add a wrapper utilizing bpf_link "infrastructure" to allow attaching BPF
programs to raw tracepoints.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Allow attaching BPF programs to kernel tracepoint BPF hooks specified by
category and name.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add ability to attach to kernel and user probes and retprobes.
Implementation depends on perf event support for kprobes/uprobes.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
bpf_program__attach_perf_event allows to attach BPF program to existing
perf event hook, providing most generic and most low-level way to attach BPF
programs. It returns struct bpf_link, which should be passed to
bpf_link__destroy to detach and free resources, associated with a link.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
bpf_link is an abstraction of an association of a BPF program and one of
many possible BPF attachment points (hooks). This allows to have uniform
interface for detaching BPF programs regardless of the nature of link
and how it was created. Details of creation and setting up of a specific
bpf_link is handled by corresponding attachment methods
(bpf_program__attach_xxx) added in subsequent commits. Once successfully
created, bpf_link has to be eventually destroyed with
bpf_link__destroy(), at which point BPF program is disassociated from
a hook and all the relevant resources are freed.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
It's often inconvenient to switch sign of error when passing it into
libbpf_strerror_r. It's better for it to handle that automatically.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-07-03
The following pull-request contains BPF updates for your *net-next* tree.
There is a minor merge conflict in mlx5 due to 8960b38932 ("linux/dim:
Rename externally used net_dim members") which has been pulled into your
tree in the meantime, but resolution seems not that bad ... getting current
bpf-next out now before there's coming more on mlx5. ;) I'm Cc'ing Saeed
just so he's aware of the resolution below:
** First conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:
<<<<<<< HEAD
static int mlx5e_open_cq(struct mlx5e_channel *c,
struct dim_cq_moder moder,
struct mlx5e_cq_param *param,
struct mlx5e_cq *cq)
=======
int mlx5e_open_cq(struct mlx5e_channel *c, struct net_dim_cq_moder moder,
struct mlx5e_cq_param *param, struct mlx5e_cq *cq)
>>>>>>> e5a3e259ef
Resolution is to take the second chunk and rename net_dim_cq_moder into
dim_cq_moder. Also the signature for mlx5e_open_cq() in ...
drivers/net/ethernet/mellanox/mlx5/core/en.h +977
... and in mlx5e_open_xsk() ...
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c +64
... needs the same rename from net_dim_cq_moder into dim_cq_moder.
** Second conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:
<<<<<<< HEAD
int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
struct dim_cq_moder icocq_moder = {0, 0};
struct net_device *netdev = priv->netdev;
struct mlx5e_channel *c;
unsigned int irq;
=======
struct net_dim_cq_moder icocq_moder = {0, 0};
>>>>>>> e5a3e259ef
Take the second chunk and rename net_dim_cq_moder into dim_cq_moder
as well.
Let me know if you run into any issues. Anyway, the main changes are:
1) Long-awaited AF_XDP support for mlx5e driver, from Maxim.
2) Addition of two new per-cgroup BPF hooks for getsockopt and
setsockopt along with a new sockopt program type which allows more
fine-grained pass/reject settings for containers. Also add a sock_ops
callback that can be selectively enabled on a per-socket basis and is
executed for every RTT to help tracking TCP statistics, both features
from Stanislav.
3) Follow-up fix from loops in precision tracking which was not propagating
precision marks and as a result verifier assumed that some branches were
not taken and therefore wrongly removed as dead code, from Alexei.
4) Fix BPF cgroup release synchronization race which could lead to a
double-free if a leaf's cgroup_bpf object is released and a new BPF
program is attached to the one of ancestor cgroups in parallel, from Roman.
5) Support for bulking XDP_TX on veth devices which improves performance
in some cases by around 9%, from Toshiaki.
6) Allow for lookups into BPF devmap and improve feedback when calling into
bpf_redirect_map() as lookup is now performed right away in the helper
itself, from Toke.
7) Add support for fq's Earliest Departure Time to the Host Bandwidth
Manager (HBM) sample BPF program, from Lawrence.
8) Various cleanups and minor fixes all over the place from many others.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.
Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.
When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:
20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.
iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.
Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
hcall_vphn() is specific to pseries and will be used in a subsequent
patch. So, move it to a more appropriate place under
arch/powerpc/platforms/pseries. Also merge vphn.h into lppaca.h
and update vphn selftest to use the new files.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
perf/core has an earlier version of the x86/cpu tree merged, to avoid
conflicts, and due to this we want to pick up this ABI impacting
revert as well:
049331f277: ("x86/fsgsbase: Revert FSGSBASE support")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2019-07-03
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix the interpreter to properly handle BPF_ALU32 | BPF_ARSH
on BE architectures, from Jiong.
2) Fix several bugs in the x32 BPF JIT for handling shifts by 0,
from Luke and Xi.
3) Fix NULL pointer deref in btf_type_is_resolve_source_only(),
from Stanislav.
4) Properly handle the check that forwarding is enabled on the device
in bpf_ipv6_fib_lookup() helper code, from Anton.
5) Fix UAPI bpf_prog_info fields alignment for archs that have 16 bit
alignment such as m68k, from Baruch.
6) Fix kernel hanging in unregister_netdevice loop while unregistering
device bound to XDP socket, from Ilya.
7) Properly terminate tail update in xskq_produce_flush_desc(), from Nathan.
8) Fix broken always_inline handling in test_lwt_seg6local, from Jiri.
9) Fix bpftool to use correct argument in cgroup errors, from Jakub.
10) Fix detaching dummy prog in XDP redirect sample code, from Prashant.
11) Add Jonathan to AF_XDP reviewers, from Björn.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Selftests are reporting this failure in test_lwt_seg6local.sh:
+ ip netns exec ns2 ip -6 route add fb00::6 encap bpf in obj test_lwt_seg6local.o sec encap_srh dev veth2
Error fetching program/map!
Failed to parse eBPF program: Operation not permitted
The problem is __attribute__((always_inline)) alone is not enough to prevent
clang from inserting those functions in .text. In that case, .text is not
marked as relocateable.
See the output of objdump -h test_lwt_seg6local.o:
Idx Name Size VMA LMA File off Algn
0 .text 00003530 0000000000000000 0000000000000000 00000040 2**3
CONTENTS, ALLOC, LOAD, READONLY, CODE
This causes the iproute bpf loader to fail in bpf_fetch_prog_sec:
bpf_has_call_data returns true but bpf_fetch_prog_relo fails as there's no
relocateable .text section in the file.
To fix this, convert to 'static __always_inline'.
v2: Use 'static __always_inline' instead of 'static inline
__attribute__((always_inline))'
Fixes: c99a84eac0 ("selftests/bpf: test for seg6local End.BPF action")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The progs for bpf selftests use several different notations to force
function inlining. Standardize to what most of them use,
static __always_inline.
Suggested-by: Song Liu <liu.song.a23@gmail.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The Intel(R) Speed select technologies contains four features.
Performance profile:An non architectural mechanism that allows multiple
optimized performance profiles per system via static and/or dynamic
adjustment of core count, workload, Tjmax, and TDP, etc. aka ISS
in the documentation.
Base Frequency: Enables users to increase guaranteed base frequency on
certain cores (high priority cores) in exchange for lower base frequency
on remaining cores (low priority cores). aka PBF in the documenation.
Turbo frequency: Enables the ability to set different turbo ratio limits
to cores based on priority. aka FACT in the documentation.
Core power: An Interface that allows user to define per core/tile
priority.
There is a multi level help for commands and options. This can be used
to check required arguments for each feature and commands for the
feature.
To start navigating the features start with
$sudo intel-speed-select --help
For help on a specific feature for example
$sudo intel-speed-select perf-profile --help
To get help for a command for a feature for example
$sudo intel-speed-select perf-profile get-lock-status --help
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Len Brown <len.brown@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Based on the following report from Smatch, fix the potential NULL
pointer dereference check:
tools/lib/bpf/libbpf.c:3493
bpf_prog_load_xattr() warn: variable dereferenced before check 'attr'
(see line 3483)
3479 int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
3480 struct bpf_object **pobj, int *prog_fd)
3481 {
3482 struct bpf_object_open_attr open_attr = {
3483 .file = attr->file,
3484 .prog_type = attr->prog_type,
^^^^^^
3485 };
At the head of function, it directly access 'attr' without checking
if it's NULL pointer. This patch moves the values assignment after
validating 'attr' and 'attr->file'.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
GCC8 started emitting warning about using strncpy with number of bytes
exactly equal destination size, which is generally unsafe, as can lead
to non-zero terminated string being copied. Use IFNAMSIZ - 1 as number
of bytes to ensure name is always zero-terminated.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There are currently no tests for ALU64 shift operations when the shift
amount is 0. This adds 6 new tests to make sure they are equivalent
to a no-op. The x32 JIT had such bugs that could have been caught by
these tests.
Cc: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
syzbot reported following spat:
BUG: KASAN: use-after-free in __write_once_size include/linux/compiler.h:221
BUG: KASAN: use-after-free in hlist_del_rcu include/linux/rculist.h:455
BUG: KASAN: use-after-free in xfrm_hash_rebuild+0xa0d/0x1000 net/xfrm/xfrm_policy.c:1318
Write of size 8 at addr ffff888095e79c00 by task kworker/1:3/8066
Workqueue: events xfrm_hash_rebuild
Call Trace:
__write_once_size include/linux/compiler.h:221 [inline]
hlist_del_rcu include/linux/rculist.h:455 [inline]
xfrm_hash_rebuild+0xa0d/0x1000 net/xfrm/xfrm_policy.c:1318
process_one_work+0x814/0x1130 kernel/workqueue.c:2269
Allocated by task 8064:
__kmalloc+0x23c/0x310 mm/slab.c:3669
kzalloc include/linux/slab.h:742 [inline]
xfrm_hash_alloc+0x38/0xe0 net/xfrm/xfrm_hash.c:21
xfrm_policy_init net/xfrm/xfrm_policy.c:4036 [inline]
xfrm_net_init+0x269/0xd60 net/xfrm/xfrm_policy.c:4120
ops_init+0x336/0x420 net/core/net_namespace.c:130
setup_net+0x212/0x690 net/core/net_namespace.c:316
The faulting address is the address of the old chain head,
free'd by xfrm_hash_resize().
In xfrm_hash_rehash(), chain heads get re-initialized without
any hlist_del_rcu:
for (i = hmask; i >= 0; i--)
INIT_HLIST_HEAD(odst + i);
Then, hlist_del_rcu() gets called on the about to-be-reinserted policy
when iterating the per-net list of policies.
hlist_del_rcu() will then make chain->first be nonzero again:
static inline void __hlist_del(struct hlist_node *n)
{
struct hlist_node *next = n->next; // address of next element in list
struct hlist_node **pprev = n->pprev;// location of previous elem, this
// can point at chain->first
WRITE_ONCE(*pprev, next); // chain->first points to next elem
if (next)
next->pprev = pprev;
Then, when we walk chainlist to find insertion point, we may find a
non-empty list even though we're supposedly reinserting the first
policy to an empty chain.
To fix this first unlink all exact and inexact policies instead of
zeroing the list heads.
Add the commands equivalent to the syzbot reproducer to xfrm_policy.sh,
without fix KASAN catches the corruption as it happens, SLUB poisoning
detects it a bit later.
Reported-by: syzbot+0165480d4ef07360eeda@syzkaller.appspotmail.com
Fixes: 1548bc4e05 ("xfrm: policy: delete inexact policies from inexact list on hash rebuild")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
"git diff" says:
\ No newline at end of file
after modifying the file.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
[mpe: Rebase since addition of another test]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The 'perf kvm' command set up things so that we can record, report, top,
etc, but not 'script', so make 'perf script' be able to process samples
by allowing to pass guest kallsyms, vmlinux, modules, etc, and if at
least one of those is provided, set perf_guest to true so that guest
samples get properly resolved.
Testing it:
# perf kvm --guest --guestkallsyms /wb/rhel6.kallsyms --guestmodules /wb/rhel6.modules record -e cycles:Gk
^C[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 3.602 MB perf.data.guest (10492 samples) ]
#
# perf evlist -i perf.data.guest
cycles:Gk
# perf evlist -v -i perf.data.guest
cycles:Gk: size: 112, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD, read_format: ID, disabled: 1, inherit: 1, exclude_user: 1, exclude_hv: 1, mmap: 1, comm: 1, freq: 1, task: 1, sample_id_all: 1, exclude_host: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
#
# perf kvm --guestkallsyms /wb/rhel6.kallsyms --guestmodules /wb/rhel6.modules report --stdio -s sym | head -30
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 10K of event 'cycles:Gk'
# Event count (approx.): 2434201408
#
# Overhead Symbol
# ........ ..............................................
#
11.93% [g] avtab_search_node
3.95% [g] sidtab_context_to_sid
2.41% [g] n_tty_write
2.20% [g] _spin_unlock_irqrestore
1.37% [g] _aesni_dec4
1.33% [g] kmem_cache_alloc
1.07% [g] native_write_cr0
0.99% [g] kfree
0.95% [g] _spin_lock
0.91% [g] __memset
0.87% [g] schedule
0.83% [g] _spin_lock_irqsave
0.76% [g] __kmalloc
0.67% [g] avc_has_perm_noaudit
0.66% [g] kmem_cache_free
0.65% [g] glue_xts_crypt_128bit
0.59% [g] __d_lookup
0.59% [g] __audit_syscall_exit
0.56% [g] __memcpy
#
Then, when trying to use perf script to generate a python script and
then process the events after adding a python hook for non-tracepoint
events:
# perf script -i perf.data.guest -g python
generated Python script: perf-script.py
# vim perf-script.py
# tail -2 perf-script.py
def process_event(param_dict):
print(param_dict["symbol"])
#
# perf script -i perf.data.guest -s perf-script.py | head
in trace_begin
vmx_vmexit
vmx_vmexit
vmx_vmexit
vmx_vmexit
vmx_vmexit
vmx_vmexit
vmx_vmexit
vmx_vmexit
vmx_vmexit
231
#
We'd see just the vmx_vmexit, i.e. the samples from the guest don't show
up.
After this patch:
# perf script --guestkallsyms /wb/rhel6.kallsyms --guestmodules /wb/rhel6.modules -i perf.data.guest -s perf-script.py 2> /dev/null | head -30
in trace_begin
apic_timer_interrupt
apic_timer_interrupt
apic_timer_interrupt
apic_timer_interrupt
apic_timer_interrupt
save_args
do_timer
drain_array
inode_permission
avc_has_perm_noaudit
run_timer_softirq
apic_timer_interrupt
apic_timer_interrupt
apic_timer_interrupt
apic_timer_interrupt
apic_timer_interrupt
kvm_guest_apic_eoi_write
run_posix_cpu_timers
_spin_lock
handle_pte_fault
rcu_irq_enter
delay_tsc
delay_tsc
native_read_tsc
apic_timer_interrupt
sys_open
internal_add_timer
list_del
rcu_exit_nohz
#
Jiri Olsa noticed we need to set 'perf_guest' to true if we want to
process guest samples and I made it be set if one of the guest files
settings get set via the command line options added in this patch, that
match those present in the 'perf kvm' command.
We probably want to have 'perf record', 'perf report' etc to notice that
there are guest samples and do the right thing, which is to look for
files with some suffix that make it be associated with the guest used to
collect the samples, i.e. if a vmlinux file is passed, we can get the
build-id from it, if not some other identifier or simply looking for
"kallsyms.guest", for instance, in the current directory.
Reported-by: Mariano Pache <npache@redhat.com>
Tested-by: Mariano Pache <npache@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Yarygin <yarygin@linux.vnet.ibm.com>
Cc: Ali Raza <alirazabhutta.10@gmail.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Joe Mario <jmario@redhat.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Orran Krieger <okrieger@redhat.com>
Cc: Ramkumar Ramachandra <artagnon@gmail.com>
Cc: Yunlong Song <yunlong.song@huawei.com>
Link: https://lkml.kernel.org/n/tip-d54gj64rerlxcqsrod05biwn@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The psock_tpacket test will need to access /proc/kallsyms, this would
require the kernel config CONFIG_KALLSYMS to be enabled first.
Apart from adding CONFIG_KALLSYMS to the net/config file here, check the
file existence to determine if we can run this test will be helpful to
avoid a false-positive test result when testing it directly with the
following commad against a kernel that have CONFIG_KALLSYMS disabled:
make -C tools/testing/selftests TARGETS=net run_tests
Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Memory_BW metric generates groups including duration_time, which
maps to a software event.
For some reason this makes the group always not count.
Always put duration_time outside a group when generating metrics. It's
always the same time, so no need to group it.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Link: http://lkml.kernel.org/r/20190628220737.13259-3-andi@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Provide an internal refcounting logic if no ->ref field is provided
in the pagemap passed into devm_memremap_pages so that callers don't
have to reinvent it poorly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The dev_pagemap is a growing too many callbacks. Move them into a
separate ops structure so that they are not duplicated for multiple
instances, and an attacker can't easily overwrite them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently KVM_STATE_NESTED_EVMCS is used to signal that eVMCS
capability is enabled on vCPU.
As indicated by vmx->nested.enlightened_vmcs_enabled.
This is quite bizarre as userspace VMM should make sure to expose
same vCPU with same CPUID values in both source and destination.
In case vCPU is exposed with eVMCS support on CPUID, it is also
expected to enable KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability.
Therefore, KVM_STATE_NESTED_EVMCS is redundant.
KVM_STATE_NESTED_EVMCS is currently used on restore path
(vmx_set_nested_state()) only to enable eVMCS capability in KVM
and to signal need_vmcs12_sync such that on next VMEntry to guest
nested_sync_from_vmcs12() will be called to sync vmcs12 content
into eVMCS in guest memory.
However, because restore nested-state is rare enough, we could
have just modified vmx_set_nested_state() to always signal
need_vmcs12_sync.
From all the above, it seems that we could have just removed
the usage of KVM_STATE_NESTED_EVMCS. However, in order to preserve
backwards migration compatibility, we cannot do that.
(vmx_get_nested_state() needs to signal flag when migrating from
new kernel to old kernel).
Returning KVM_STATE_NESTED_EVMCS when just vCPU have eVMCS enabled
have a bad side-effect of userspace VMM having to send nested-state
from source to destination as part of migration stream. Even if
guest have never used eVMCS as it doesn't even run a nested
hypervisor workload. This requires destination userspace VMM and
KVM to support setting nested-state. Which make it more difficult
to migrate from new host to older host.
To avoid this, change KVM_STATE_NESTED_EVMCS to signal eVMCS is
not only enabled but also active. i.e. Guest have made some
eVMCS active via an enlightened VMEntry. i.e. vmcs12 is copied
from eVMCS and therefore should be restored into eVMCS resident
in memory (by copy_vmcs12_to_enlightened()).
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maran Wilson <maran.wilson@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The target is to compare the performance difference (cycles diff) for
the same basic blocks in different data files.
The same basic block means same function, same start address and same
end address. This patch finds the same basic blocks from different data
files and link them together and resort by the cycles diff.
v3:
---
The block stuffs are maintained by new structure 'block_hist',
so this patch is update accordingly.
v2:
---
Since now the basic block hists is changed to per symbol,
the patch only links the basic block hists for the same
symbol in different data files.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1561713784-30533-6-git-send-email-yao.jin@linux.intel.com
[ sym->name is an array, not a pointer, so no need to check it for NULL, fixes de build in some distros ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The hist__account_cycles() can account cycles per basic block. The basic
block information is saved in cycles_hist structure.
This patch processes each symbol, get basic blocks from cycles_hist and
add the basic block entries to a new hists (in 'struct block_hist').
Using a hists is because we need to compare, sort and print the basic
blocks later.
v6:
---
Since 'ops' argument is removed from hists__add_entry_block,
update the code accordingly. No functional change.
v5:
---
Since now we still carry block_info in 'struct hist_entry'
we don't need to use our own new/free ops for hist entries.
And the block_info is released in hist_entry__delete.
v3:
---
1. In v2, we put block stuffs in 'struct hist_entry', but
it's not a good design. In v3, we create a new
'struct block_hist' and cast the 'struct hist_entry' to
'struct block_hist' in some places, which can avoid adding
new stuffs in 'struct hist_entry'.
2. abs() -> labs(), in block_cycles_diff_cmp().
v2:
---
v1 adds the basic block entries to per data-file hists
but v2 adds the basic block entries to per symbol hists.
That is to keep current perf-diff format. Will show the
result in next patches.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1561713784-30533-5-git-send-email-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>