Replace build_id byte array with struct build_id object and all the code
that references it.
The objective is to carry size together with build id array, so it's
better to keep both together.
This is preparatory change for following patches, and there's no
functional change.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20201013192441.1299447-2-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
We'll use it to ask for extra config files to be loaded, profile like
stuff that will be used first to make 'perf trace' mimic 'strace' output
via a 'perf strace' command that just sets up 'perf trace' output.
At some point it'll be used for regression tests, where we'll run some
simple commands like:
perf strace ls > perf-strace.output
strace ls > strace.output
And then do some mutable syscall arg aware diff like tool to deal with
arguments for things like mmap, that change at each execution, to be
first ignored and then properly tracked when used accoss multiple
syscalls.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Some distros don't come with python2 and only have python3 available.
This causes the "'import perf' in python" self test to fail.
This change adds python3 to the list of possible python versions
that are autodetected but maintains the priorities for
'python2' and 'python' detection. Python3 has the lowest priority.
Committer notes:
On a fedora system without python2 packages the 'perf test python'
continues to work:
# python2
bash: python2: command not found...
Similar command is: 'python'
# rpm -qa | grep python2
#
That "Similar command" gives the clue:
# rpm -qf /usr/bin/python
python-unversioned-command-3.8.5-5.fc32.noarch
# rpm -ql python-unversioned-command
/usr/bin/python
/usr/share/man/man1/python.1.gz
#
With it in place the 'python' binary is found and perf builds the python
binding using python3:
# perf test -v python
19: 'import perf' in python :
--- start ---
test child forked, pid 379988
python usage test: "echo "import sys ; sys.path.append('/tmp/build/perf/python'); import perf" | '/usr/bin/python' "
test child finished with 0
---- end ----
'import perf' in python: Ok
#
Looking at that path:
# ls -la /tmp/build/perf/python
total 1864
drwxrwxr-x. 2 acme acme 60 Oct 13 16:20 .
drwxrwxr-x. 18 acme acme 4420 Oct 13 16:28 ..
-rwxrwxr-x. 1 acme acme 1907216 Oct 13 16:28 perf.cpython-38-x86_64-linux-gnu.so
#
And:
# ldd ~/bin/perf | grep python
libpython3.8.so.1.0 => /lib64/libpython3.8.so.1.0 (0x00007f5471187000)
#
As soon as we remove it:
# rpm -e python-unversioned-command-3.8.5-5.fc32.noarch
# hash -r
# python
bash: python: command not found...
Install package 'python-unversioned-command' to provide command 'python'? [N/y] n
#
And rebuilding perf now doesn't find python in the system:
make: Entering directory '/home/acme/git/perf/tools/perf'
BUILD: Doing 'make -j24' parallel build
<SNIP>
Makefile.config:786: No python interpreter was found: disables Python support - please install python-devel/python-dev
<SNIP>
After this patch:
$ rpm -qi python-unversioned-command
package python-unversioned-command is not installed
$
$ python
bash: python: command not found...
Install package 'python-unversioned-command' to provide command 'python'? [N/y] ^C
$
$ m
make: Entering directory '/home/acme/git/perf/tools/perf'
BUILD: Doing 'make -j24' parallel build
<SNIP>
CC /tmp/build/perf/tests/attr.o
CC /tmp/build/perf/tests/python-use.o
DESCEND plugins
GEN /tmp/build/perf/python/perf.so
INSTALL trace_plugins
LD /tmp/build/perf/tests/perf-in.o
LD /tmp/build/perf/perf-in.o
LINK /tmp/build/perf/perf
<SNIP>
make: Leaving directory '/home/acme/git/perf/tools/perf'
19: 'import perf' in python : Ok
$ ldd ~/bin/perf | grep python
libpython3.8.so.1.0 => /lib64/libpython3.8.so.1.0 (0x00007f2c8c708000)
$ ls -la /tmp/build/perf/python
total 1864
drwxrwxr-x. 2 acme acme 60 Oct 13 16:20 .
drwxrwxr-x. 18 acme acme 4420 Oct 13 16:31 ..
-rwxrwxr-x. 1 acme acme 1907216 Oct 13 16:31 perf.cpython-38-x86_64-linux-gnu.so
$
Signed-off-by: James Clark <james.clark@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LPU-Reference: 20201005080645.6588-1-james.clark@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
'perf trace ls' started crashing after commit d21cb73a90 on
!HAVE_SYSCALL_TABLE_SUPPORT configs (armv7l here) like this:
0 strlen () at ../sysdeps/arm/armv6t2/strlen.S:126
1 0xb6800780 in __vfprintf_internal (s=0xbeff9908, s@entry=0xbeff9900, format=0xa27160 "]: %s()", ap=..., mode_flags=<optimized out>) at vfprintf-internal.c:1688
...
5 0x0056ecdc in fprintf (__fmt=0xa27160 "]: %s()", __stream=<optimized out>) at /usr/include/bits/stdio2.h:100
6 trace__sys_exit (trace=trace@entry=0xbeffc710, evsel=evsel@entry=0xd968d0, event=<optimized out>, sample=sample@entry=0xbeffc3e8) at builtin-trace.c:2475
7 0x00566d40 in trace__handle_event (sample=0xbeffc3e8, event=<optimized out>, trace=0xbeffc710) at builtin-trace.c:3122
...
15 main (argc=2, argv=0xbefff6e8) at perf.c:538
It is because memset in trace__read_syscall_info zeroes wrong memory:
1) when initializing for the first time, it does not reset the last id.
2) in other cases, it resets the last id of previous buffer.
ad 1) it causes the crash above as sc->name used in the fprintf above
contains garbage.
ad 2) it sets nonexistent from true back to false for id 11 here. Not
sure, what the consequences are.
So fix it by introducing a special case for the initial initialization
and do the right +1 in both cases.
Fixes: d21cb73a90 ("perf trace: Grow the syscall table as needed when using libaudit")
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20201001093419.15761-1-jslaby@suse.cz
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Since commit b027cc6fdf ("perf c2c: Fix 'perf c2c record -e list' to
show the default events used"), "perf c2c" tool can show the memory
events properly, it's no reason to still suggest user to use the
command "perf mem record -e list" for showing events.
This patch updates the usage for showing memory events with command
"perf c2c record -e list".
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Ian Rogers <irogers@google.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20201011121022.22409-1-leo.yan@linaro.org
The 'perf sched latency' tool is really useful at showing worst-case
latencies that task encountered since wakeup. However it shows only the
end of the latency. Often times the start of a latency is interesting as
it can show what else was going on at the time to cause the latency. I
certainly myself spending a lot of time backtracking to the start of the
latency in "perf sched script" which wastes a lot of time.
This patch therefore adds a new column "Max delay start". Considering
this, also rename "Maximum delay at" to "Max delay end" as its easier to
understand.
Example of the new output:
----------------------------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Max delay start | Max delay end |
----------------------------------------------------------------------------------------------------------------------------------
MediaScannerSer:11936 | 651.296 ms | 67978 | avg: 0.113 ms | max: 77.250 ms | max start: 477.691360 s | max end: 477.768610 s
audio@2.0-servi:(3) | 0.000 ms | 3440 | avg: 0.034 ms | max: 72.267 ms | max start: 477.697051 s | max end: 477.769318 s
AudioOut_1D:8112 | 0.000 ms | 2588 | avg: 0.083 ms | max: 64.020 ms | max start: 477.710740 s | max end: 477.774760 s
Time-limited te:14973 | 7966.090 ms | 24807 | avg: 0.073 ms | max: 15.563 ms | max start: 477.162746 s | max end: 477.178309 s
surfaceflinger:8049 | 9.680 ms | 603 | avg: 0.063 ms | max: 13.275 ms | max start: 476.931791 s | max end: 476.945067 s
HeapTaskDaemon:(3) | 1588.830 ms | 7040 | avg: 0.065 ms | max: 6.880 ms | max start: 473.666043 s | max end: 473.672922 s
mount-passthrou:(3) | 1370.809 ms | 68904 | avg: 0.011 ms | max: 6.524 ms | max start: 478.090630 s | max end: 478.097154 s
ReferenceQueueD:(3) | 11.794 ms | 1725 | avg: 0.014 ms | max: 6.521 ms | max start: 476.119782 s | max end: 476.126303 s
writer:14077 | 18.410 ms | 1427 | avg: 0.036 ms | max: 6.131 ms | max start: 474.169675 s | max end: 474.175805 s
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20200925235634.4089867-1-joel@joelfernandes.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
For comparison, it now runs the benchmark twice - one if regular -b and
another for --buildid-all.
$ perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 21.002 msec (+- 0.172 msec)
Average time per event: 2.059 usec (+- 0.017 usec)
Average memory usage: 8169 KB (+- 0 KB)
Average build-id-all injection took: 19.543 msec (+- 0.124 msec)
Average time per event: 1.916 usec (+- 0.012 usec)
Average memory usage: 7348 KB (+- 0 KB)
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20201012070214.2074921-7-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
It was reported that 'perf stat' crashed when using with armv8_pmu (CPU)
events with the task mode. As 'perf stat' uses an empty cpu map for
task mode but armv8_pmu has its own cpu mask, it has confused which map
it should use when accessing file descriptors and this causes segfaults:
(gdb) bt
#0 0x0000000000603fc8 in perf_evsel__close_fd_cpu (evsel=<optimized out>,
cpu=<optimized out>) at evsel.c:122
#1 perf_evsel__close_cpu (evsel=evsel@entry=0x716e950, cpu=7) at evsel.c:156
#2 0x00000000004d4718 in evlist__close (evlist=0x70a7cb0) at util/evlist.c:1242
#3 0x0000000000453404 in __run_perf_stat (argc=3, argc@entry=1, argv=0x30,
argv@entry=0xfffffaea2f90, run_idx=119, run_idx@entry=1701998435)
at builtin-stat.c:929
#4 0x0000000000455058 in run_perf_stat (run_idx=1701998435, argv=0xfffffaea2f90,
argc=1) at builtin-stat.c:947
#5 cmd_stat (argc=1, argv=0xfffffaea2f90) at builtin-stat.c:2357
#6 0x00000000004bb888 in run_builtin (p=p@entry=0x9764b8 <commands+288>,
argc=argc@entry=4, argv=argv@entry=0xfffffaea2f90) at perf.c:312
#7 0x00000000004bbb54 in handle_internal_command (argc=argc@entry=4,
argv=argv@entry=0xfffffaea2f90) at perf.c:364
#8 0x0000000000435378 in run_argv (argcp=<synthetic pointer>,
argv=<synthetic pointer>) at perf.c:408
#9 main (argc=4, argv=0xfffffaea2f90) at perf.c:538
To fix this, I simply used the given cpu map unless the evsel actually
is not a system-wide event (like uncore events).
Fixes: 7736627b86 ("perf stat: Use affinity for closing file descriptors")
Reported-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Barry Song <song.bao.hua@hisilicon.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20201007081311.1831003-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
So that in older systems we get it in the mmap flags scnprintf routines:
$ tools/perf/trace/beauty/mmap_flags.sh | head -9 2> /dev/null
static const char *mmap_flags[] = {
[ilog2(0x40) + 1] = "32BIT",
#ifndef MAP_32BIT
#define MAP_32BIT 0x40
#endif
[ilog2(0x01) + 1] = "SHARED",
#ifndef MAP_SHARED
#define MAP_SHARED 0x01
#endif
$
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
It'll also conditionally generate the defines, so that if we don't have
those when building a new tool tarball in an older systems, we get
those, and we need them sometimes in the actual scnprintf routine, such
as when checking if a flags means we have an extra arg, like with
MREMAP_FIXED.
$ tools/perf/trace/beauty/mremap_flags.sh
static const char *mremap_flags[] = {
[ilog2(1) + 1] = "MAYMOVE",
#ifndef MREMAP_MAYMOVE
#define MREMAP_MAYMOVE 1
#endif
[ilog2(2) + 1] = "FIXED",
#ifndef MREMAP_FIXED
#define MREMAP_FIXED 2
#endif
[ilog2(4) + 1] = "DONTUNMAP",
#ifndef MREMAP_DONTUNMAP
#define MREMAP_DONTUNMAP 4
#endif
};
$
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Some headers are not used in building the tools directly, but instead to
generate tables that then gets source code included to do id->string and
string->id lookups for things like syscall flags and commands.
We were adding it directly to tools/include/ and this sometimes gets in
the way of building using system headers, lets untangle this a bit.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
- NFSv4.2: copy_file_range needs to invalidate caches on success
- NFSv4.2: Fix security label length not being reset
- pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly
on read
- pNFS/flexfiles: Fix signed/unsigned type issues with mirror
indices"
* tag 'nfs-for-5.9-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
pNFS/flexfiles: Be consistent about mirror index types
pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly on read
NFSv4.2: fix client's attribute cache management for copy_file_range
nfs: Fix security label length not being reset
It seems likely this block was pasted from internal_get_user_pages_fast,
which is not passed an mm struct and therefore uses current's. But
__get_user_pages_locked is passed an explicit mm, and current->mm is not
always valid. This was hit when being called from i915, which uses:
pin_user_pages_remote->
__get_user_pages_remote->
__gup_longterm_locked->
__get_user_pages_locked
Before, this would lead to an OOPS:
BUG: kernel NULL pointer dereference, address: 0000000000000064
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
CPU: 10 PID: 1431 Comm: kworker/u33:1 Tainted: P S U O 5.9.0-rc7+ #140
Hardware name: LENOVO 20QTCTO1WW/20QTCTO1WW, BIOS N2OET47W (1.34 ) 08/06/2020
Workqueue: i915-userptr-acquire __i915_gem_userptr_get_pages_worker [i915]
RIP: 0010:__get_user_pages_remote+0xd7/0x310
Call Trace:
__i915_gem_userptr_get_pages_worker+0xc8/0x260 [i915]
process_one_work+0x1ca/0x390
worker_thread+0x48/0x3c0
kthread+0x114/0x130
ret_from_fork+0x1f/0x30
CR2: 0000000000000064
This commit fixes the problem by using the mm pointer passed to the
function rather than the bogus one in current.
Fixes: 008cfe4418 ("mm: Introduce mm_struct.has_pinned")
Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Reported-by: Harald Arnesen <harald@skogtun.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ensure 'st' is initialized before an error branch is taken.
Fixes test "67: Parse and process metrics" with LLVM msan:
==6757==WARNING: MemorySanitizer: use-of-uninitialized-value
#0 0x5570edae947d in rblist__exit tools/perf/util/rblist.c:114:2
#1 0x5570edb1c6e8 in runtime_stat__exit tools/perf/util/stat-shadow.c:141:2
#2 0x5570ed92cfae in __compute_metric tools/perf/tests/parse-metric.c:187:2
#3 0x5570ed92cb74 in compute_metric tools/perf/tests/parse-metric.c:196:9
#4 0x5570ed92c6d8 in test_recursion_fail tools/perf/tests/parse-metric.c:318:2
#5 0x5570ed92b8c8 in test__parse_metric tools/perf/tests/parse-metric.c:356:2
#6 0x5570ed8de8c1 in run_test tools/perf/tests/builtin-test.c:410:9
#7 0x5570ed8ddadf in test_and_print tools/perf/tests/builtin-test.c:440:9
#8 0x5570ed8dca04 in __cmd_test tools/perf/tests/builtin-test.c:661:4
#9 0x5570ed8dbc07 in cmd_test tools/perf/tests/builtin-test.c:807:9
#10 0x5570ed7326cc in run_builtin tools/perf/perf.c:313:11
#11 0x5570ed731639 in handle_internal_command tools/perf/perf.c:365:8
#12 0x5570ed7323cd in run_argv tools/perf/perf.c:409:2
#13 0x5570ed731076 in main tools/perf/perf.c:539:3
Fixes: commit f5a56570a3 ("perf test: Fix memory leaks in parse-metric test")
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: clang-built-linux@googlegroups.com
Link: http://lore.kernel.org/lkml/20200923210655.4143682-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The --for-each-cgroup option is a syntax sugar to monitor large number
of cgroups easily. Current command line requires to list all the events
and cgroups even if users want to monitor same events for each cgroup.
This patch addresses that usage by copying given events for each cgroup
on user's behalf.
For instance, if they want to monitor 6 events for 200 cgroups each they
should write 1200 event names (with -e) AND 1200 cgroup names (with -G)
on the command line. But with this change, they can just specify 6
events and 200 cgroups with a new option.
A simpler example below: It wants to measure 3 events for 2 cgroups ('A'
and 'B'). The result is that total 6 events are counted like below.
$ perf stat -a -e cpu-clock,cycles,instructions --for-each-cgroup A,B sleep 1
Performance counter stats for 'system wide':
988.18 msec cpu-clock A # 0.987 CPUs utilized
3,153,761,702 cycles A # 3.200 GHz (100.00%)
8,067,769,847 instructions A # 2.57 insn per cycle (100.00%)
982.71 msec cpu-clock B # 0.982 CPUs utilized
3,136,093,298 cycles B # 3.182 GHz (99.99%)
8,109,619,327 instructions B # 2.58 insn per cycle (99.99%)
1.001228054 seconds time elapsed
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200924124455.336326-3-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull Kbuild fixes from Masahiro Yamada:
- ignore compiler stubs for PPC to fix builds
- fix the usage of --target mentioned in the LLVM document
* tag 'kbuild-fixes-v5.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
Documentation/llvm: Fix clang target examples
scripts/kallsyms: skip ppc compiler stub *.long_branch.* / *.plt_branch.*
Pull x86 fixes from Thomas Gleixner:
"Two fixes for the x86 interrupt code:
- Unbreak the magic 'search the timer interrupt' logic in IO/APIC
code which got wreckaged when the core interrupt code made the
state tracking logic stricter.
That caused the interrupt line to stay masked after switching from
IO/APIC to PIC delivery mode, which obviously prevents interrupts
from being delivered.
- Make run_on_irqstack_code() typesafe. The function argument is a
void pointer which is then cast to 'void (*fun)(void *).
This breaks Control Flow Integrity checking in clang. Use proper
helper functions for the three variants reuqired"
* tag 'x86-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioapic: Unbreak check_timer()
x86/irq: Make run_on_irqstack_cond() typesafe
Pull timer updates from Thomas Gleixner:
"A set of clocksource/clockevents updates:
- Reset the TI/DM timer before enabling it instead of doing it the
other way round.
- Initialize the reload value for the GX6605s timer correctly so the
hardware counter starts at 0 again after overrun.
- Make error return value negative in the h8300 timer init function"
* tag 'timers-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/timer-gx6605s: Fixup counter reload
clocksource/drivers/timer-ti-dm: Do reset before enable
clocksource/drivers/h8300_timer8: Fix wrong return value in h8300_8timer_init()
Pinned pages shouldn't be write-protected when fork() happens, because
follow up copy-on-write on these pages could cause the pinned pages to
be replaced by random newly allocated pages.
For huge PMDs, we split the huge pmd if pinning is detected. So that
future handling will be done by the PTE level (with our latest changes,
each of the small pages will be copied). We can achieve this by let
copy_huge_pmd() return -EAGAIN for pinned pages, so that we'll
fallthrough in copy_pmd_range() and finally land the next
copy_pte_range() call.
Huge PUDs will be even more special - so far it does not support
anonymous pages. But it can actually be done the same as the huge PMDs
even if the split huge PUDs means to erase the PUD entries. It'll
guarantee the follow up fault ins will remap the same pages in either
parent/child later.
This might not be the most efficient way, but it should be easy and
clean enough. It should be fine, since we're tackling with a very rare
case just to make sure userspaces that pinned some thps will still work
even without MADV_DONTFORK and after they fork()ed.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows copy_pte_range() to do early cow if the pages were pinned on
the source mm.
Currently we don't have an accurate way to know whether a page is pinned
or not. The only thing we have is page_maybe_dma_pinned(). However
that's good enough for now. Especially, with the newly added
mm->has_pinned flag to make sure we won't affect processes that never
pinned any pages.
It would be easier if we can do GFP_KERNEL allocation within
copy_one_pte(). Unluckily, we can't because we're with the page table
locks held for both the parent and child processes. So the page
allocation needs to be done outside copy_one_pte().
Some trick is there in copy_present_pte(), majorly the wrprotect trick
to block concurrent fast-gup. Comments in the function should explain
better in place.
Oleg Nesterov reported a (probably harmless) bug during review that we
didn't reset entry.val properly in copy_pte_range() so that potentially
there's chance to call add_swap_count_continuation() multiple times on
the same swp entry. However that should be harmless since even if it
happens, the same function (add_swap_count_continuation()) will return
directly noticing that there're enough space for the swp counter. So
instead of a standalone stable patch, it is touched up in this patch
directly.
Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This prepares for the future work to trigger early cow on pinned pages
during fork().
No functional change intended.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(Commit message majorly collected from Jason Gunthorpe)
Reduce the chance of false positive from page_maybe_dma_pinned() by
keeping track if the mm_struct has ever been used with pin_user_pages().
This allows cases that might drive up the page ref_count to avoid any
penalty from handling dma_pinned pages.
Future work is planned, to provide a more sophisticated solution, likely
to turn it into a real counter. For now, make it atomic_t but use it as
a boolean for simplicity.
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull clocksource/clockevent fixes from Daniel Lezcano:
- Fix wrong signed return value when checking of_iomap in the probe
function for the h8300 timer (Tianjia Zhang)
- Fix reset sequence when setting up the timer on the dm_timer (Tony
Lindgren)
- Fix counter reload when the interrupt fires on gx6605s (Guo Ren)
Pull SCSI fixes from James Bottomley:
"Three fixes: one in drivers (lpfc) and two for zoned block devices.
The latter also impinges on the block layer but only to introduce a
new block API for setting the zone model rather than fiddling with the
queue directly in the zoned block driver"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sd: sd_zbc: Fix ZBC disk initialization
scsi: sd: sd_zbc: Fix handling of host-aware ZBC disks
scsi: lpfc: Fix initial FLOGI failure due to BBSCN not supported
Pull io_uring fixes from Jens Axboe:
"Two fixes for regressions in this cycle, and one that goes to 5.8
stable:
- fix leak of getname() retrieved filename
- remove plug->nowait assignment, fixing a regression with btrfs
- fix for async buffered retry"
* tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block:
io_uring: ensure async buffered read-retry is setup properly
io_uring: don't unconditionally set plug->nowait = true
io_uring: ensure open/openat2 name is cleaned on cancelation
Pull block fixes from Jens Axboe:
"NVMe pull request from Christoph, and removal of a dead define.
- fix error during controller probe that cause double free irqs
(Keith Busch)
- FC connection establishment fix (James Smart)
- properly handle completions for invalid tags (Xianting Tian)
- pass the correct nsid to the command effects and supported log
(Chaitanya Kulkarni)"
* tag 'block-5.9-2020-09-25' of git://git.kernel.dk/linux-block:
block: remove unused BLK_QC_T_EAGAIN flag
nvme-core: don't use NVME_NSID_ALL for command effects and supported log
nvme-fc: fail new connections to a deleted host or remote port
nvme-pci: fix NULL req in completion handler
nvme: return errors for hwmon init
Merge misc fixes from Andrew Morton:
"9 patches.
Subsystems affected by this patch series: mm (thp, memcg, gup,
migration, memory-hotplug), lib, and x86"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: don't rely on system state to detect hot-plug operations
mm: replace memmap_context by meminit_context
arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writeback
lib/memregion.c: include memregion.h
lib/string.c: implement stpcpy
mm/migrate: correct thp migration stats
mm/gup: fix gup_fast with dynamic page table folding
mm: memcontrol: fix missing suffix of workingset_restore
mm, THP, swap: fix allocating cluster for swapfile by mistake